BIPI
BIPI

Jailbreak Benchmarks Are Not Safety Certificates

AI Security

Public benchmarks like HarmBench and JailbreakBench measure narrow slices of attack behavior. Passing them tells you almost nothing about how your deployed model handles real adversaries with budget and patience.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 2, 2024 · 7 min read

#llm#jailbreak#red-team#ai-security

A product team showed us a slide last quarter claiming their assistant was 'safe' because it scored 94 percent refusal on HarmBench. Three weeks later we got it to leak its system prompt with a base64-wrapped roleplay payload that took twelve minutes to write. The benchmark was not lying. It was answering a different question than the one the team thought it was answering.

Public jailbreak suites have done real work for the field. Before HarmBench and JailbreakBench, vendor claims were unfalsifiable. Now there is a shared yardstick. The problem is that the yardstick measures attacks frozen in time against a fixed prompt template, and your attackers are neither frozen nor fixed.

What the benchmarks actually measure

HarmBench evaluates 400 behaviors across categories like cybercrime, harassment, and misinformation. JailbreakBench tracks 100 prompts with a leaderboard. Both run a single-turn attack against a single model variant with default sampling. That design choice makes results reproducible, which matters for science, but it strips out the conditions that produce real failures.

  • Single turn only. No conversation steering, no slow-burn priming across 20 messages.
  • Fixed temperature. Real users vary sampling, top_p, and seed across attempts until something sticks.
  • Public dataset. Vendors fine-tune on the test set. We have seen four cases where benchmark scores improved by 30 points with no change to actual safety.
  • English only for most rows. Translation pivots routinely break filters that pass benchmark English.

What we run instead during engagements

On a recent assessment for a fintech assistant, we ran the public benchmark first, scored 91 percent refusal, then ran our own protocol and got 38 percent. The gap was multi-turn priming, function-call abuse where the model trusted its own tool output, and indirect injection through retrieved documents. None of those are in HarmBench.

Our internal protocol layers four things on top of the public suites. We add multi-turn campaigns where each message is benign in isolation. We attack the retrieval layer because RAG context bypasses most safety training. We probe tool use, where the model treats its own function output as authoritative. And we run the attacks in the languages your customers actually speak, which for one client meant Tagalog and Vietnamese, not the French and German on the leaderboard.

Benchmark refusal rate is a floor, not a ceiling. If you score 60 percent on HarmBench you have a problem. If you score 95 percent you have an unknown.

Bridging benchmark to production risk

The translation we use with clients is simple. Benchmark score is a regression signal. Watch the delta when you change the model, prompt, or guardrail. Treat absolute numbers as marketing. For real risk, you need a private evaluation set built from your own threat model, refreshed every quarter, and never published.

  1. Build a private 200-prompt eval covering your top five abuse categories. Pull from real abuse reports if you have them.
  2. Run it against every model and prompt change before promotion. Track the delta, not the absolute.
  3. Add a multi-turn variant where each prompt has three follow-ups designed to escalate.
  4. Include 20 percent retrieval poisoning cases if you do RAG. This is where production breaks.
  5. Keep the eval set out of training data. Audit your fine-tuning pipeline for leakage.

What to tell leadership

When the CISO asks if the model is safe, the answer is never a benchmark number. The answer is what categories of harm matter, what your refusal rate is on a private eval that mirrors those, and what your detection coverage looks like for the failures that get past the model. The benchmark belongs in the appendix.

We have watched three different teams set OKRs on HarmBench score and then fail external red team exercises after hitting 95 percent. The score went up because the model learned the test, not because the system got safer. If you cannot answer the question 'what would a determined attacker with two weeks try' then a benchmark cannot answer it for you.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.