BIPI
BIPI

Production Tool-Use Failures: What Actually Breaks in Agent Deployments

Agentic AI

We pulled six weeks of production logs from agent deployments across three clients to catalogue how tool calling fails. Schema drift, parameter pollution, and runaway loops dominate. Here is what to instrument before you ship.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 16, 2024 · 8 min read

#ai-agents#tool-use#reliability

Six weeks. Three clients. 1.4 million tool invocations. We tagged every failure and sorted them into categories. The results were not what the model documentation predicted. Schema mismatches were the largest single category at 31 percent, followed by parameter pollution at 22 percent, ambiguous tool selection at 17 percent, runaway loops at 11 percent, and the long tail of everything else.

Schema mismatches dominate

31%
of failures were schema mismatches
22%
were parameter pollution
11%
were runaway tool loops
4%
were hallucinated tool names

A schema mismatch is when the model produces JSON that validates against the JSON schema but does not match the runtime expectation. The classic case: a field declared as string accepts both date strings and natural language like next Tuesday. The schema does not catch it. The downstream Postgres insert blows up. We added a normalization layer between the model and every tool that re-validates against stricter Pydantic types. Failure rate dropped 60 percent in two weeks.

Parameter pollution is sneakier than it sounds

Parameter pollution happens when the model passes extra fields the tool did not ask for. OpenAI's tool calling silently drops unknown fields by default. Anthropic's does not. We have logs where Claude passed a metadata field that the function happily ignored, but the model assumed the metadata had been written and continued the conversation as if a side effect had occurred. The agent then reported success to the user when nothing had actually persisted.

Fix: every tool function returns an echo of what was actually written, not just success: true. The model now has ground truth instead of having to infer it.

Ambiguous tool names cause routing mistakes

We had three tools: search_users, search_user_history, find_user. The model picked between them inconsistently across runs. After renaming to search_users_by_email, get_user_audit_log, and lookup_user_by_id, the routing-error rate fell from 17 percent to under 2 percent. Tool names are not internal API names. They are part of the prompt and the model treats them as documentation.

  • Prefix tool names with the verb (search_, get_, create_, update_, delete_)
  • Include the entity type and the lookup key in the name when it disambiguates
  • Avoid abbreviations the model has not seen in training data
  • Test renames on a held-out eval set before shipping. We have seen renames that fixed one routing problem and created two new ones.

Runaway loops between tools

Two tools that call back into each other. Tool A returns a reference that the agent passes to tool B. Tool B returns a reference that the agent passes back to tool A. The agent thinks it is making progress. The token bill says otherwise. We saw a single conversation consume 240k tokens in one turn before the per-turn cap fired.

Every agent loop needs a forced stop, not just a soft limit. The model will not detect that it is looping. The infrastructure has to.

Our forced-stop heuristic: if the same tool is called with semantically similar arguments more than three times in a single turn, terminate the turn and return an explicit no-progress error to the user. We use embedding cosine similarity at 0.92 as the threshold. False positives are rare because legitimate retries usually have different arguments.

Hallucinated tool names are rarer than expected

Hallucinated tool names made up only 4 percent of failures, much lower than the model release notes had implied. Most modern frontier models stay within the declared tool registry. The cases we saw clustered around model upgrades: when we swapped GPT-4o for GPT-4.1, hallucinated tool names briefly spiked then settled. Run a smoke test on every model version bump before promoting to production.

What to instrument before launch

We require every agent deployment to emit four traces per tool call: the raw model output before parsing, the parsed arguments, the tool response, and the tool latency. Without all four you cannot diagnose schema drift, parameter pollution, or routing mistakes. We ship to Honeycomb because the wide-event model fits agent traces, but any structured logging backend that supports high-cardinality fields will work.

The smallest viable observability stack

If you cannot run Honeycomb or Datadog, the minimum we have shipped successfully is: stdout JSON logs with a trace_id per conversation, Loki for storage, Grafana for queries. It costs less than 50 dollars a month for a single-tenant deployment and it has caught real incidents.

Most agent failures are mundane. Fix the boring ones first. Schema validation, tool naming, and bounded retries get you 80 percent of the way to a reliable production agent.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.