Prompt Caching Cost Math: When It Pays and When It Burns
Agentic AI
Anthropic's prompt caching can save 90 percent of input cost or burn 25 percent extra if your patterns are wrong. The math, the failure modes, and when to enable it.
By Arjun Raghavan, Security & Systems Lead, BIPI · March 25, 2026 · 6 min read
Prompt caching is one of those features that looks like a free 90 percent discount and turns out to be more nuanced. Used correctly on the right workload, it does deliver the discount. Used carelessly on the wrong workload, the bookkeeping cost (initial cache write is 25 percent more expensive than a normal call) eats the savings.
We have moved enough production workloads onto prompt caching now to know the math. Here is the framework we use to decide whether to turn it on.
How prompt caching actually works
When you mark a prefix of your prompt as cacheable, the model provider stores the KV-cache state at the end of that prefix. Subsequent requests with the same prefix skip recomputation, which is where the discount comes from. The cache lives for a TTL — Anthropic's default is 5 minutes; longer TTLs are available at higher base prices.
The prefix has to be byte-identical for the cache to hit. One whitespace difference, one updated date in the system prompt, one tool description tweak — cache miss, full price, plus a 25 percent surcharge to write the new cache.
When caching saves money
Three patterns are clear wins.
- Long static prompts hit many times. A 50,000-token system prompt that gets the same shape every call across thousands of users. Caching saves 90 percent of input cost on the static portion.
- Document chat. The user asks repeated questions about the same uploaded document. Cache the document (it is identical across questions). Per-question cost drops to the marginal user-input tokens only.
- Few-shot examples that do not change. The 20 carefully crafted examples stay cached for the lifetime of the deployment. Only the new user query incurs full pricing.
When caching costs more
Three patterns lose money. First, low cache hit rate. If your prefix changes per request (timestamps, dynamic context, A/B prompt variants), every call is a cache write. Net cost is 25 percent higher than not caching.
Second, very short prompts. Caching has overhead. A 200-token prompt is cheaper to recompute than to manage cache state. Below roughly 1024 tokens, do not bother.
Third, traffic that is bursty enough to evict. If a deployment averages one call per 10 minutes, the 5-minute TTL means most calls are cache misses anyway. Evaluate average traffic against TTL before turning it on.
Cache hit rate as a SLO
Once caching is enabled, treat the cache hit rate as an operational metric. Below 60 percent, caching is barely paying for itself. Below 40 percent, caching is costing money. Both vendors expose hit rate in their response headers; alert on it the way you alert on error rate.
The decision framework
- If your prefix is over 4,000 tokens and changes less than once a day: enable caching.
- If your prefix is under 1,000 tokens or changes per-request: do not enable caching.
- Between those, instrument first, decide second. Measure expected hit rate against your traffic shape.
- If you enable caching, add a hit-rate SLO and alert below 60 percent.
Closing
Prompt caching is real and saves real money in the right workloads. It is also a footgun if you turn it on without checking the math. The teams that benefit are the ones that treat it as a cost-engineering decision, not a feature-flag toggle. Run the numbers, monitor the hit rate, and your inference bill becomes a tunable knob instead of a surprise.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.