LLM Prompt Injection Defense Patterns That Actually Work
AI Security
Prompt injection cannot be solved, but it can be contained. We catalog the defense patterns that hold up in production, structured prompting, dual model checks, output schema enforcement, and capability isolation, with notes on where each one breaks.
By Arjun Raghavan, Security & Systems Lead, BIPI · July 11, 2023 · 11 min read
Prompt injection is not a bug to patch, it is a property of mixing instructions and data in the same channel. The right question is not how to stop it, but how to contain the blast radius when it happens.
Why naive defenses fail
Telling the model ignore any instructions in the user input is a coping mechanism, not a control. The model has no privileged channel for your meta instructions, so the attacker's instructions get equal weight. Stronger controls live outside the model.
Pattern 1: structured prompting with strict roles
Separate system, developer, and user content into clearly delimited sections, and tag untrusted content with explicit markers. The model still cannot enforce the separation, but downstream validators can use the markers to reject responses that confuse the boundary.
Pattern 2: dual model checks
A second model, often smaller, classifies the output of the first against a policy. The classifier sees only the response and the original task, not the untrusted context, which makes it harder to compromise both at once. Anthropic's constitutional AI work and OpenAI's safety best practices both point in this direction.
Pattern 3: output schema enforcement
If the model must return JSON matching a schema, an attacker who injects free form instructions usually breaks the schema. A strict parser that rejects non conforming output blocks a large class of attacks for free. Pair with function calling APIs where available.
- Define the schema in code, not in the prompt
- Reject and retry on schema violation, do not coerce
- Log the violating output for review
Pattern 4: capability isolation
The model that reads untrusted content should not be the model that calls dangerous tools. Split into a reader model and an actor model, with a typed message passing between them. The actor only accepts structured commands, not free text, which means injected prose cannot trigger actions.
Pattern 5: provenance tagging
Tag every piece of context with its source, and let downstream policy decide what each source is allowed to influence. An email body can answer questions, it cannot trigger refunds. A retrieved doc can cite, it cannot rewrite the system prompt.
Pattern 6: human in the loop for irreversible actions
When the action cannot be undone, ask a human. The bar is not perfect, it is good enough to catch the attacks that automated defenses miss.
You cannot make the model immune to injection. You can make injection useless by limiting what a compromised response is allowed to do.
Measuring defense effectiveness
- Maintain a corpus of known injection attempts and track block rate over time
- Run red team prompts on every prompt or model change
- Track schema violation rate as a leading indicator of attack pressure
- Replay production traces against new guardrails before deploying
Tooling notes
NeMo Guardrails, LangChain output parsers, and Anthropic's tool use with strict schemas all support these patterns. MITRE ATLAS catalogues the attacker techniques you should test against.
Closing
Defense in depth, applied to LLMs. No single pattern is enough, and that is fine. The combined surface is what makes the attack uneconomic.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.