Debugging Agent Loops Without Losing Your Weekend
Agentic AI
Agent loops fail in ways that traditional debuggers cannot reach. We share the playbook our team uses for trace replay, intermediate state capture, and behavior diffing across model versions.
By Arjun Raghavan, Security & Systems Lead, BIPI · May 5, 2024 · 7 min read
An ops lead pinged us at 11pm on a Saturday. Their procurement agent had spent four hours rewriting the same purchase order, calling the same approval tool 87 times before timing out. The logs showed the tool returning success each time. The agent did not seem to believe the success.
Welcome to debugging agent loops. The tooling most teams reach for, breakpoints and println, is mostly useless here. The bug is rarely in your code. It is in the interaction between your prompt, the model's reasoning, your tool responses, and accumulated context.
Capture everything, regret nothing
First investment, before any framework choice: capture the entire trace. That means the system prompt, the user message, every tool definition passed in, every tool call argument, every tool response, every model output including reasoning content if available, token counts per step, and timestamps. Persist it as a tree, not a flat list.
When the procurement agent failed, the only reason we solved it before sunrise was that the team had a trace store. We replayed step 12 into step 13 and saw the model interpret a 200 OK as a soft error because the tool's natural language description included the phrase will return success but may require retry. The model took it literally.
Trace replay as a first-class workflow
Trace replay means you can take a stored trace, swap one component (a tool response, a prompt, a model version), and rerun deterministically up to a chosen step. Without it, you are guessing. With it, you are running experiments.
- Replay with the same model and inputs to confirm reproduction.
- Replay swapping the bad tool response for a corrected one to confirm the loop unsticks.
- Replay with a different model to see if the failure is model-specific.
- Replay with a tweaked prompt to test the fix before deploying.
We build replay harnesses early in every engagement now. Two days of investment saves weeks of guesswork.
Behavior diffing across versions
The other technique that pays off: behavior diffing. When you change a prompt, a tool description, or a model, run the change against a saved corpus of traces. Compare tool call counts, completion rates, and final outputs. Even with non-determinism, regressions usually show up as statistical shifts.
On one engagement we caught a 30 percent regression in task completion the day a vendor silently rotated their model snapshot. Our nightly diff job flagged it. The team had been debugging unrelated symptoms for a week.
Determinism where you can get it
You will not get bit-exact determinism from a frontier model. You can get a lot closer than you think. Pin the model snapshot, not the family. Set temperature to zero for debug runs even if production runs warmer. Capture seeds where the provider exposes them. Use the same tool response cache during replay. None of this gives you C-style reproducibility. It gives you enough that two runs of the same trace usually agree on the macro behavior, which is what you need to compare fixes.
What we put on the dashboard
- Steps per task, percentile distribution, alert when p95 doubles.
- Tool call repetition rate within a single trace, alert above a threshold.
- Self-loop detection: same tool plus same args twice in a row.
- Completion rate by task type, with a 7-day moving average.
- Token cost per task, broken down by step.
These five signals catch most agent regressions before users notice. The procurement client now spots loops at step five instead of step 87. The Saturday-night pages stopped two weeks after we shipped the dashboard.
Debugging agents is a separate discipline from debugging code. Treat it as one. Build the trace store, the replay harness, the diff job, and the four dashboards. The first incident pays for all of it.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.