Prompt Injection and Jailbreak Are Not the Same Problem
AI Security
Teams routinely conflate prompt injection with jailbreak and end up with defenses that address neither well. The threat models are different, the attackers are different, and the controls that work for one rarely work for the other.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 14, 2024 · 6 min read
On a recent assessment we asked a security team to walk us through their LLM threat model. The slide had one row labeled 'prompt injection / jailbreak.' The mitigations were 'system prompt hardening' and 'output filtering.' Neither would stop either attack reliably, and the team did not realize they were facing two distinct threats with two distinct attacker populations.
The terms get used interchangeably in vendor marketing and casual conversation. The difference matters because the defenses are not the same. Confusing them produces security theater.
Jailbreak: the user is the attacker
A jailbreak is when the authenticated user of an LLM tries to get it to do something the operator does not want it to do. The attacker controls the entire input. The threat model is roughly equivalent to a sandbox escape. Defenses live in alignment, refusal training, output filters, and content moderation.
Jailbreak attackers are typically curious users, researchers, abuse-seekers, or competitors testing your safety posture. Their goal is to get the model to produce harmful content, leak system prompts, or bypass policy. The blast radius is usually one user session.
Prompt injection: a third party is the attacker
Prompt injection is when untrusted data flowing into the model's context window manipulates the model on behalf of someone other than the user. The user is the victim, not the attacker. The classic example is a customer support agent that reads a malicious email, follows instructions in the email body, and exfiltrates data from the user's account.
The threat model here is closer to cross-site scripting. The attacker is upstream of the user, in a document, an email, a webpage, a tool output, anywhere the model ingests external content. Defenses live in input segregation, capability scoping, and treating model context like untrusted data.
- Jailbreak: user vs operator. Defenses: alignment, content filters, refusal training.
- Prompt injection: third party vs user. Defenses: input isolation, tool permission scoping, context provenance tracking.
- Indirect prompt injection: third party content reaches the model through retrieval or tool use.
- Stored prompt injection: malicious content sits in your data store waiting for a user to query it.
Why teams conflate them
Both look like 'model does something it should not.' Both involve crafted text. Both show up in the same incident channels. The difference is which trust boundary failed, and that boundary is invisible in most product designs.
A team building a customer-facing chatbot worries mostly about jailbreak. A team building an agent that reads emails and takes actions on the user's behalf has prompt injection as their primary risk. We have seen agentic products ship with elaborate jailbreak defenses and zero protection against indirect injection through tool outputs. The result is a system that resists abuse from the user but trusts every email it reads.
Jailbreak defenses harden the model. Prompt injection defenses harden the architecture. You need both, and they are not interchangeable.
Different controls for different threats
For jailbreak, the controls are mostly inside the model boundary. Refusal fine-tuning, output classifiers, content moderation APIs, multi-turn safety monitors. They reduce success rate but do not eliminate it.
For prompt injection, the controls are mostly outside the model boundary. Tag untrusted content in the context window so the model can distinguish trust levels. Scope tool permissions tightly so a successful injection cannot exfiltrate data. Require user confirmation for sensitive actions. Log and review tool calls. None of these care whether the model 'detects' the injection.
What to put in your threat model
Two rows, not one. Each with its own attacker, asset, and control list. We tell clients to start every LLM threat modeling session by asking 'who controls each token in the context window' and 'what happens if any of them is malicious.' That question separates the two threats faster than any framework. The teams that conflate them never ask it.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.