BIPI
BIPI

Agent Observability: What to Log, What to Alert On, What to Skip

Agentic AI

Generic OpenTelemetry will not tell you why your agent is misbehaving. We map the agent-specific signals that matter, the alerts that earn their pages, and where the current vendor landscape actually helps.

By Arjun Raghavan, Security & Systems Lead, BIPI · May 11, 2024 · 7 min read

#ai-agents#observability#telemetry

Most teams come to us with traces and dashboards already. They have OpenTelemetry, a metrics backend, structured logs. The agent still goes off the rails and nobody knows why for two days. The infrastructure is fine. The signals are wrong.

Agent observability is not just APM with a different logo. It needs signals that map to how agents actually fail: prompt drift, tool misuse, context bloat, plan thrash. The vendors get some of this right, none get all of it.

The signals that matter

Across two dozen agent deployments, these are the signals that have earned their place on every dashboard.

  • Tokens per request, broken down by step (system prompt, user, tool definitions, tool results, output).
  • Tool call count per task, with distribution percentiles.
  • Prompt version per request. Yes, version. If you cannot say which prompt produced which trace, you cannot debug.
  • Latency per step, not per request. The bottleneck is almost never where you think.
  • Tool error rate per tool, separated from model error rate.
  • Self-loop detection: same tool plus same args twice in a row.
  • Plan revision count: how many times did the agent rewrite its plan in this trace.

If your dashboard does not have prompt version on it, fix that first. We have spent days chasing regressions that turned out to be a prompt change three commits earlier. Tagging the trace with the prompt hash takes 20 minutes.

Alerts worth paging on

Most teams over-alert and under-act. Our short list of alerts that have earned their pages.

  1. Tool call repetition above a threshold within a single trace. Strong signal of stuck loops.
  2. Token spend per user per hour exceeds a budget. Cost control and abuse detection in one.
  3. Completion rate drop greater than 10 percent over a 24 hour rolling window.
  4. p95 step count doubled compared to the prior 7 day baseline.
  5. Any tool call with elevated risk classification, surfaced as a low-priority ticket per occurrence.

Note what is not on the list: latency. Agent latency varies wildly by task. Alerting on latency in isolation generates noise. Alert on completion rate and step count instead.

The vendor landscape, briefly and bluntly

We have shipped agent telemetry through Langfuse, Helicone, Arize Phoenix, and home-grown OpenTelemetry pipelines. Each has a sweet spot.

  • Langfuse: best of the bunch for prompt versioning, dataset evals tied to traces, and self-hosting if you have data residency requirements.
  • Helicone: lightest integration, drop-in proxy, fine if your needs are mostly cost and latency tracking.
  • Phoenix: strong on retrieval evaluation and embedding drift, weaker on tool-centric agent flows.
  • Roll your own on OpenTelemetry: works but expect three to six weeks of build to match what you would get from Langfuse out of the box.

Pick based on what the team will actually use. The best telemetry stack is the one your engineers open every day, not the most feature-complete one.

What we strip out, every time

Most homegrown agent dashboards have 30 charts and tell you nothing. We aggressively prune. The five charts that survive on every engagement: completion rate, tokens per task, tool calls per task, latency by step, top failure modes by category. Everything else lives a layer down, available on click-through, not on the home view.

Observability is not about volume of data. It is about the speed at which an engineer can answer is something wrong, and if so, where. Build for that and the rest takes care of itself.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.