Build LLM Cost Controls Before the Bill Surprises You.
Agentic AI
Most LLM cost incidents are not bugs. They are the system working as designed, with no rate limits, no per-tenant budgets, and a runaway loop in production. Here is the cost-control stack we ship by default.
By Arjun Raghavan, Security & Systems Lead, BIPI · April 12, 2026 · 7 min read
Three of our last five LLM-platform engagements started with a cost incident. Not a bug, not an outage, just a bill that was 4 to 12 times larger than the projection. In every case the team had built the LLM integration the way they would build any external API call: it works, it is in production, and there are no limits on it.
LLM APIs are not normal external APIs. The cost per call varies by 1000x depending on input length. There is no quota that fails closed. A bug that retries on a long context can multiply your monthly bill by 100x in a weekend. If you are building anything more than a prototype, cost controls are not optional.
The four-layer cost control stack
The pattern we build by default has four layers. Each catches a different class of cost incident.
Layer 1: Per-request token cap
Every LLM call goes through a single client wrapper. The wrapper has a max_tokens cap on input and output. If the input exceeds the cap, the request fails immediately, before it reaches the provider. This is the first line of defence against runaway prompts: someone wrote a function that concatenates a database row into the prompt, the row turned out to be 200K tokens of user-uploaded text, and the call would have cost $30. The cap turns it into a 400 error you can fix.
We typically set this at 32K input, 4K output for most use cases. Specific endpoints raise it explicitly with documentation.
Layer 2: Per-user / per-tenant rate limit
Token-based rate limit, not request-based. A user who sends 100 short prompts per minute is fine. A user who sends one prompt per minute that is 300K tokens is not. Most rate-limiting libraries are request-count by default. For LLM use cases, you need to be tracking input tokens.
The implementation is a Redis sliding window keyed on (user_id, model). Every request increments by its input token count. The limit is per minute and per day. Users see a 429 with a clear error before they hit the cost ceiling.
Most cost incidents are one customer with a runaway loop. Per-tenant rate limiting turns that into one customer's problem instead of yours.
Layer 3: Daily budget with hard cutoff
Each tenant has a USD-denominated budget per day. The wrapper tracks the cumulative cost. When a request would push them over budget, it returns 402 Payment Required. This catches the case where a single tenant is doing legitimately heavy work that just costs more than expected, before it impacts the platform's overall margin.
For internal use cases (engineering team using a coding assistant, support team using a summariser), the budget is per-user or per-team. The principle is the same: someone has to opt into a higher number, with a paper trail.
Layer 4: Provider-side spend alerts
OpenAI, Anthropic, and Google all support spend alerts and hard limits at the API key level. These are the last resort: if everything else has failed, the provider will refuse the call. Set them. The defaults are usually 'no limit, alert at $10K/month,' which is too late.
We split keys per environment (dev, staging, prod) and per tenant tier (free, pro, enterprise). Each key has its own limit. A bug in dev cannot eat the prod budget.
Things that look like cost controls but are not
- Cheaper model fallback. 'If the cheap model fails, use the expensive one' often turns into 'cheap model returns garbage that fails validation, falls back to expensive model on every request, bill triples.' Validate carefully.
- Caching alone. Caching helps amortise legitimate workload. It does not stop a runaway loop where the cache key changes every iteration.
- Per-feature flags. 'We disabled the AI feature when the bill spiked' is incident response, not cost control. By the time the human is in the loop, the bill has happened.
What the bill actually looks like
A real example from a client incident: an autocomplete feature called the LLM on every keystroke (debounced poorly), with the user's full document as context. Average document was 8K tokens. 200 users, 4 hours of typing each, 60 keystrokes per minute, debounce that fired 12 times per minute on average. Total: ~115M input tokens in one workday. At $3 per million for the model in use, $345 for the day. They had no budget cap. They found out three days later when they looked at their invoice.
All four layers above would have caught it: the per-request cap would have rejected the 8K-token autocomplete payload as obviously wrong; the per-user rate limit would have throttled after a few minutes; the daily budget would have kicked in at hour 1; the provider-side alert would have notified at 10x normal volume.
Closing
LLM costs are not a 'we'll get to that later' problem. By the time you have a cost incident, the money is gone. Build the four-layer stack on day one. The defensive code is a few hundred lines. The alternative is an invoice that arrives before you know you have a bug.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.