BIPI
BIPI

Agent Error Recovery: Retries, Circuit Breakers, and Human Handoff

Agentic AI

LLM agents fail in stranger ways than traditional services: hallucinated tool names, malformed JSON, transient model overload. We document the recovery patterns we run in production and where each one breaks.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 13, 2024 · 7 min read

#ai-agents#error-handling#reliability

Last quarter we shipped an agent that called a calendar API. On day three it started invoking a tool named create_meeting_v2. That tool did not exist. The model had hallucinated a version suffix that matched a deprecated internal name from training data. Our recovery layer caught it, surfaced the closest valid tool, and the agent self-corrected on the next turn. Without that layer the agent would have looped.

The five failure classes we see

After eighteen months of running agents for clients in fintech and healthtech, we sort errors into five buckets. Each one needs a different recovery strategy:

  1. Hallucinated tool names: model invents a tool that does not exist. Recovery: fuzzy match against the registered tool list, return a structured error with the closest three matches.
  2. Malformed JSON arguments: model emits valid JSON but with missing required fields or wrong types. Recovery: return a Pydantic validation error verbatim to the model.
  3. Transient model errors: 529 overload, gateway timeouts, regional outages. Recovery: exponential backoff with jitter, fail open to a cheaper model if appropriate.
  4. Tool execution failures: downstream API returns 500, database timeout, expired credentials. Recovery: bounded retry, then surface to agent with explicit failure context.
  5. Logical loops: agent calls the same tool with the same arguments three times. Recovery: detect and force-terminate the turn.

Why string-matching tool names is fragile

Our first version did exact string matching on tool names. Anything unrecognised raised. We learned the hard way that models drift, especially across provider updates. A Claude 3.5 to Claude 3.7 swap changed how the model formatted tool names in roughly 2 percent of completions. We now run a Levenshtein distance check with a threshold of 3 plus a semantic check using embeddings. Anything within either threshold gets auto-corrected with a log line. Anything outside both gets surfaced to the agent as a structured error.

Retry budgets, not retry counts

The classic retry-3-times-with-exponential-backoff pattern is wrong for agents. The agent is already in a loop. If a tool fails and we retry three times internally, then the agent retries the same tool three more times in its own loop, we have nine attempts and a confused model. We moved to a single retry budget per agent turn: total retries across all tools cannot exceed five. When the budget is exhausted, we hand the failure to the agent with a clear message and stop intervening.

Retries inside an agent loop are not free. They cost tokens, they cost latency, and they teach the model that flaky tools are normal.

Circuit breakers per tool

We run a Hystrix-style circuit breaker per tool, not per agent. If a tool fails more than 30 percent of calls in a rolling 60-second window, the breaker opens and we return a synthetic error to the agent without ever calling the downstream. The synthetic error includes the breaker state so the agent can route around the broken tool. This pattern saved us during a Stripe partial outage last March: agents auto-routed to the manual refund workflow within 90 seconds of the breaker tripping.

When to escalate to a human

We escalate to a human queue when any of these conditions hold: the retry budget is exhausted, the user's intent is unparsable after two clarification attempts, the agent attempts a write operation that exceeds a tenant-defined risk threshold, or the conversation contains an explicit safety flag. The escalation payload includes the full tool trace and the model's last reasoning trace so the human is not starting from scratch.

What we still get wrong

Detecting logical loops is the hardest problem. A loop is sometimes legitimate: the agent is polling for a result. Sometimes it is broken: the agent is repeatedly asking for confirmation. We use a heuristic that looks at semantic similarity of recent tool calls. False positive rate is around 4 percent. We are evaluating a small classifier trained on labelled traces to do better.

Build the recovery layer before you ship the agent, not after the first production incident. The cost of retrofitting it is at least 4x the cost of building it in from day one, and your users remember the broken interactions for far longer than the working ones.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.