BIPI
BIPI

The 25-Item Pre-Production Checklist for LLM Agents

Agentic AI

Shipping an agent without a checklist is how teams end up with five-figure bills and a security incident in the same week. We share the 25-item checklist we walk every client through before promoting to production.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 25, 2024 · 8 min read

#ai-agents#deployment#production

We have done agent production reviews for 14 clients in the past year. The pattern is consistent: teams build a working demo in two weeks then spend four months hardening it. Most of the hardening is the same work across clients. We wrote it down as a 25-item checklist. If you cannot tick every item, do not promote to production.

Cost and rate limiting

Cost overruns kill more agent projects than any other failure mode. Set hard ceilings before users touch the system.

  1. Per-tenant token budget enforced at the gateway, not just monitored
  2. Per-conversation token cap with a hard stop and explicit user message
  3. Per-tool-call cost ceiling for expensive tools like web search and code execution
  4. Daily and monthly spend alerts at 50, 80, and 95 percent of the budget
  5. Provider-side spending limits configured on every API key

Prompt management

Prompts are code. Treat them like code.

  1. All prompts in version control, never hard-coded in deployment configs
  2. Prompt versioning with the ability to roll back to a previous version in under five minutes
  3. Per-environment prompt overrides for staging versus production
  4. Prompt review process: every prompt change goes through PR review
  5. Prompt eval suite that runs on every prompt change

Evals and quality

If you cannot measure quality, you cannot ship safely.

  1. Held-out eval set of at least 100 examples covering happy path, edge cases, and adversarial inputs
  2. Automated eval run on every code or prompt change with a quality gate
  3. Regression detection: alert if quality drops more than 2 points
  4. Human review queue for sampled production conversations, target 1 percent sampling rate

Observability

When something goes wrong you need to reconstruct what happened in minutes, not hours.

  1. Full trace per conversation: prompts, tool calls, tool responses, model outputs
  2. Trace retention of at least 30 days, longer for regulated industries
  3. Per-tenant dashboard showing cost, latency, error rate, and tool call distribution
  4. Alerting on p99 latency, error rate, and cost burn rate

Security and abuse

Agents are a new attack surface. Prompt injection is not a theoretical risk.

  1. Input sanitisation: strip or escape user input before it reaches sensitive tool calls
  2. Output filtering: scan model outputs for PII, secrets, and policy violations
  3. Tool authorization: every tool call checks that the calling user owns the resource
  4. Audit log for every privileged tool invocation, retained for compliance window
  5. Rate limiting per source IP and per user account, separate from token budgets

Rollback and recovery

Every deployment will eventually have to be rolled back. Make it boring.

  1. Single command rollback to the previous prompt and model combination
  2. Blue-green deployment for model upgrades with traffic shifting
  3. Kill switch: ability to disable the agent and route users to a fallback path
If you cannot roll back in under five minutes you cannot ship safely. Practice it on a Wednesday before you need it on a Friday at 11 PM.

The two items teams skip most often

In our reviews, two items get skipped more than any others. The first is per-tenant token budgets enforced at the gateway. Teams monitor cost but do not enforce it. When a prompt injection or a bug fires, the cost is already incurred by the time the alert reaches a human. The second is rollback rehearsal. Teams have the rollback path documented but have never tested it. We make every client do a Wednesday rollback drill before the first production deploy.

Tenant isolation

For multi-tenant deployments, isolation has to be explicit at every layer.

  1. Tenant ID propagated through every log line and trace
  2. Vector index segregated by tenant, never query across tenants without explicit user permission

What happens if you skip the checklist

We have post-mortems for clients who skipped items. The most expensive was a 47,000 dollar bill in 18 hours from a misbehaving customer-facing agent that lacked a per-tenant cap. The most embarrassing was an agent that leaked one tenant's data to another tenant because the vector index was shared. Both were preventable. Both are on this checklist.

Print the checklist, paste it into your launch runbook, and refuse to promote without every item ticked. Your future self will thank you on the morning after the first production incident.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.