BIPI
BIPI

Building an Internal LLM Red Team Program

AI Security

External red teams find what they are paid to find. Internal teams find what hurts you in production. The skill mix, cadence, and reporting structure of an effective LLM red team look different from a traditional offensive security team.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 26, 2024 · 8 min read

#red-team#llm#ai-security#program

A well-known SaaS company asked us to help them stand up an internal LLM red team last year. Their first proposal was to take three traditional pentesters and have them learn LLMs. Six months in, they had filed 40 issues. Most were variations of 'we got it to swear.' None mapped to real product risk. We rebooted the program with a different team structure and the next 90 days produced 14 issues, four of which became P0s.

LLM red teaming is closer to product testing than to traditional offensive security. The bugs are often not exploits but misalignments between model behavior and product expectation. The skill mix and reporting structure have to reflect that.

Skill mix that works

The team we eventually built had four roles. Each role brings something the others lack. A team of all one type produces narrow findings.

  • Offensive security background: brings adversarial mindset, knowledge of injection patterns, exploit chaining intuition.
  • ML engineering background: understands model behavior, fine-tuning effects, evaluation methodology, what is fixable.
  • Domain expert: someone who knows the product domain, what users actually do, what failure looks like to a customer.
  • Linguist or social scientist: catches manipulation patterns, register shifts, cultural and language-specific bypasses that engineers miss.

We have run programs with three of these and they always have a blind spot. The most commonly skipped role is the linguist. Most teams do not realize they need one until a researcher finds a Tagalog-language jailbreak that their entire English-speaking team missed for six months.

Tooling stack we recommend

Open source tooling has matured but is fragmented. We help clients standardize on a stack rather than collecting tools.

  1. Garak or PyRIT for automated probe execution. PyRIT has better orchestration. Garak has more probes out of the box.
  2. Custom eval harness for product-specific tests. Off-the-shelf will not cover your business logic.
  3. Logging and replay system. Every interaction with the model under test should be recorded and replayable.
  4. Issue tracker integration that maps red team findings to product Jira or Linear with severity, reproducer, and fix owner.
  5. Private prompt corpus management. Not in any LLM provider's training data. Treat it like attack signatures.

Cadence and scope

We see two failure modes on cadence. Too rare, where red team runs once before launch and never again. Too constant, where the team is in a permanent fire-drill and never gets to research. Both produce thin findings.

The cadence we have settled on with most clients is a four-week cycle. Week one is research and probe development for the next focus area. Weeks two and three are active testing against current production and pre-release models. Week four is reporting, fix verification on previous findings, and program retrospective. Major model launches trigger a focused two-week deep dive that interrupts the cycle.

How findings should flow to product

Traditional security findings tend to flow to a security ticketing system and get triaged on a security severity scale. LLM red team findings rarely fit that model. Most are not exploitable in the traditional sense. They are behavior gaps, brand risk, and edge case handling failures.

We push clients to file findings directly into product engineering's tracker, not security's. Severity is a joint call between red team and product. Fix ownership is product. Security advises on whether the residual risk after fix is acceptable. This structure changes the political dynamic. Red team becomes a resource for product quality, not a blocker.

Reporting up

The CEO does not need to know how many jailbreak prompts you ran. They need to know whether residual risk on the model is rising, flat, or falling, and how that compares to peer products. We track three metrics over time. Refusal rate on a stable private eval. Mean time from finding to fix. Coverage of the abuse taxonomy by category.

Six months in, the SaaS client we mentioned had moved from 'is anyone testing this' to 'we can quote our refusal rate against three categories of buyer-relevant abuse and show the trend over six releases.' That is the maturity step that lets the program survive a budget review. Without it, every cycle is a renegotiation of why the team exists.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.