BIPI
BIPI

Rate Limiting LLM Agents: Token Buckets Are Not Enough

Agentic AI

Classic HTTP rate limiting falls apart when each agent call costs a variable number of tokens and triggers async tool fan-out. We share the multi-dimensional limiter architecture we run for production LLM agents.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 10, 2024 · 7 min read

#ai-agents#rate-limiting#operations

The first time we tried to rate limit an agentic workflow with a standard Nginx limit_req zone, we got paged at 2 AM. The agent had triggered 14 tool calls in a single user turn, each one a fresh model invocation, and we burned 380k tokens before the 100 requests-per-minute ceiling ever fired. HTTP rate limiting assumes uniform cost. Agents violate that assumption on every turn.

Why classic limiters fail

A typical /chat call to GPT-4o looks like one HTTP request from the gateway's perspective, but inside that request the agent might read 60k context tokens, write 4k output tokens, then issue six tool calls that each spawn another 8k-token model invocation. Counting requests instead of tokens lets a single chatty user spend more than 50 quiet users combined.

We measured this on a Claude Sonnet 4.5 deployment serving 220 tenants. The top 1 percent of conversations consumed 38 percent of tokens. Request-based limits caught none of them.

The three-dimensional budget

Our production limiter tracks three independent dimensions per tenant and refills them on different schedules:

  • Input tokens per minute: refilled linearly, capped at 4x the tenant's committed throughput. Protects upstream provider quota.
  • Output tokens per minute: refilled at half the input rate because output is 3x more expensive on most models and dominates the bill.
  • Tool calls per conversation: hard ceiling of 25 calls per user turn to break runaway loops. Soft warning at 10.

Each dimension is a separate Redis token bucket keyed on tenant_id. We use the redis-cell module so the atomic CL.THROTTLE command returns retry-after in one round trip. Naive INCR plus EXPIRE leaks under contention and we proved this with a chaos test that spawned 800 concurrent agent sessions.

Pre-flight estimation

Token cost is unknown until the model finishes streaming. We solved this with pre-flight estimation: count the input tokens with tiktoken before the API call, assume max_tokens for output, reserve that budget atomically, then refund the unused portion when the stream closes. A failed reservation returns 429 before any provider cost is incurred.

Reserve the worst case, refund the unused. It is the only way to keep an agent from overdrawing while a stream is still in flight.

Async tool calls break naive accounting

When an agent issues parallel tool calls, our orchestrator fans out before any individual call has finished. A naive limiter sees N concurrent requests appear at the same wall-clock instant and either lets all of them through or rejects all of them. Both are wrong. We hold the parent reservation open across the tool fan-out and child reservations are debited from it, so a single user turn can never exceed the parent budget no matter how aggressive the fan-out.

This required rewriting our tool-call dispatcher to pass the parent reservation id as a header. It was three days of work and it eliminated 90 percent of the cost-overrun pages we used to see.

Per-tenant headroom and burst

Most tenants idle most of the time. We give every tenant a 10x burst capacity that refills over 15 minutes, so a researcher running a one-off batch does not get throttled while a misbehaving service stays bounded. The burst pool is separate from the sustained pool so a tenant cannot drain its long-term budget in a single minute.

What we monitor

We export four metrics to Prometheus per tenant: tokens_in_used_rate, tokens_out_used_rate, tool_calls_active, and reservation_overdraft_count. The last one should always be zero. If it is non-zero, the limiter has a bug and we treat it as a Sev 2.

If you are running agents in production and still rate limiting on HTTP requests, you are one prompt injection away from a five-figure bill. Move to token-aware buckets, add a per-conversation tool ceiling, and run the chaos test before a tenant runs it for you.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.