Surface Filters Are Not Safety: Bypass Patterns We See Weekly
AI Security
Regex blocklists and naive classifier filters get bypassed routinely by encoding, framing, and indirect requests. Real content safety needs layered defense across input, model, and output, and most teams have only one of the three.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 17, 2024 · 6 min read
A consumer product team showed us their content safety architecture last fall. It was a single regex blocklist with about 800 entries, applied to user input. We bypassed it in eleven minutes with base64 encoding on three terms. The team had spent six months building it. The bypass worked because they had layered nothing on top.
Surface filters fail not because regex is bad but because attackers route around them. The same input can be expressed in dozens of ways. Real safety architecture assumes the filter will fail and catches the failure somewhere else.
The bypass patterns that work in 2026
These are not sophisticated. We see them in basic abuse traffic, not just red team engagements. A production filter that does not handle them is essentially decorative.
- Encoding: base64, rot13, hex, leetspeak, unicode homoglyphs. Cheap and works against blocklists.
- Framing: 'write a story where a character explains', 'imagine you are training data', 'translate this to French'. Slips past intent classifiers.
- Indirect requests: ask for the components separately, then combine. Each piece is benign.
- Roleplay layering: nested fictional contexts that establish permissions before the actual request.
- Tool abuse: in agentic systems, ask the tool to retrieve content the model would refuse to generate.
- Language pivots: translate to a low-resource language, get the response, translate back.
Why one layer always loses
A regex blocklist matches surface strings. Attackers transform surface strings. A classifier-based filter matches semantic intent on the input. Attackers split intent across messages or hide it in framing. A model self-check on output catches some of the residue but not all of it. Each layer has a hole. The holes overlap less than you think.
We measured this on a recent client engagement. Their input filter caught 67 percent of attacks. Their model refusal training caught 71 percent. Their output classifier caught 58 percent. Run individually, the worst layer let 42 percent through. Run in combination, the residual was under 4 percent. The math only works if the layers are independent, which is not free to engineer.
Layered defense that actually composes
- Input layer: cheap filters for known abuse patterns, encoding detection, length limits. Not the primary defense, but it cuts noise.
- Context layer: structured prompts that segregate user input from system instructions and retrieved content with explicit trust labels.
- Model layer: refusal training and constitutional methods. The model itself should refuse most known categories.
- Output layer: classifier on the response, not the request. Different model class than the generator. Catches anything the generator was tricked into producing.
- Action layer: for agentic systems, require confirmation before sensitive actions regardless of what the model decided.
How to find your bypass surface
Set a budget of 40 hours and have an engineer who is not on the safety team try to break the filters. No special access, no documentation. If they cannot bypass anything in 40 hours, your filters are reasonable. If they bypass within an hour, you have one layer pretending to be a system.
We ran this exercise with a client whose CEO was confident in their safety stack. The bypass came in 23 minutes via a translation pivot. The exercise cost them one engineer-week and a few weeks of follow-up work. The alternative was finding out from a journalist or a regulator.
What to tell product teams
Filters are speed bumps, not walls. Design the system assuming determined users will get past them. The question that matters is 'what happens when the filter fails.' If the answer is 'a user sees content that violates policy and the incident is logged for review,' that is acceptable. If the answer is 'the model takes an irreversible action with the user's credentials,' you have an architecture problem, not a filter problem.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.