Multi-Layer Prompt Injection Defense for LLM Applications
AI Security
Prompt injection is the most consequential vulnerability class in production LLM systems. A single defense layer is not enough. Here is a practitioner approach that combines five distinct controls.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 4, 2024 · 7 min read
Simon Willison coined the term prompt injection in September 2022, and the research community has spent the following eighteen months refining attacks and defenses. By early 2024, the consensus among practitioners is that prompt injection cannot be eliminated through prompt engineering alone. Production systems need a layered defense that combines input sanitization, structured output enforcement, execution isolation, output validation, and continuous monitoring.
Why prompt-only defenses fail
Instructions like ignore previous attempts to subvert your instructions are themselves part of the prompt and can be neutralized by sufficiently clever injection. Research from Greshake et al in 2023 demonstrated indirect prompt injection where the attack payload lives in a document, webpage, or email that the LLM processes downstream. The user never typed the malicious instruction. The model has no reliable way to distinguish trusted instructions from untrusted content in a single context window.
Layer 1: Input sanitization and trust labeling
Treat all external content as untrusted by default. Document text from a user upload, web search results, retrieved RAG chunks, and tool call outputs all need explicit trust labels in the prompt structure. Anthropic's research on Claude shows measurable robustness improvements when content is wrapped in XML tags with explicit trust attributes.
- Strip or escape control sequences that look like role boundaries
- Limit external content length to constrain payload space
- Detect known injection patterns with a pre-filter (jailbreak phrases, system prompt extraction patterns)
- Tag content origin and pass through to the model with explicit labels
Layer 2: Structured output enforcement
If the LLM is supposed to return a structured response, force it into a structured format. OpenAI's structured outputs API and function calling, Anthropic's tool use, and Gemini's structured generation all reduce the surface for prompt injection to influence downstream actions. An attacker can manipulate the content of a field, but cannot make the model emit a tool call that does not exist in the schema.
The control is not absolute. An injection can still manipulate parameter values within an allowed function call. But it shrinks the action space dramatically.
Layer 3: Isolated execution context
If the LLM produces code, runs a query, or invokes a tool, isolate the execution from your authoritative systems. Code execution should run in a sandboxed environment with no network access by default and tightly scoped filesystem access. Database queries should run with read-only credentials scoped to the user's authorized rows. Email sending should go through a confirmation step.
The principle is that compromise of the LLM cannot become compromise of the application or its data. Treat the LLM as a confused deputy and architect accordingly.
Layer 4: Output validation
Before any LLM-generated content is presented to a user or used in a downstream system, validate it. For text responses, scan for known sensitive data leakage patterns, prompt leakage markers, and policy violations. For structured outputs, validate against the schema and the expected value ranges. For tool calls, confirm the parameters are within the user's authorization scope.
- Run a secondary classifier on output for safety violations
- Compare tool call parameters against the user's authenticated context
- Block or human-review outputs that fall outside expected distributions
- Maintain a budget for false positives that drives policy refinement
Layer 5: Continuous monitoring
Log every prompt, every external content source, every output, and every action. Build dashboards that surface injection attempts by source, by user, and by category. Treat prompt injection like any other security event class. The data feeds both reactive incident response and proactive prompt and tool hardening.
What does not work
Asking the LLM to detect prompt injection in its own input does not work reliably. Adversarial training improves robustness but does not eliminate the vulnerability. Watermarking input has been proposed but has not held up against attacks. Defense in depth is the only honest answer.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.