BIPI
BIPI

Detecting LLM Hallucinations in Production

AI Security

Hallucination is the failure mode that erodes user trust faster than any other in production LLM systems. Detection is hard, but a combination of techniques can catch the worst cases before they reach users.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 7, 2024 · 7 min read

#llm#hallucination#ai-reliability

Hallucination is the term of art for an LLM producing content that is fluent, confident, and incorrect. In production systems, hallucinations cause customer support agents to invent return policies, legal assistants to cite imaginary case law, and clinical decision support tools to recommend non-existent medications. Detection in production is harder than in evaluation because you do not have ground truth at inference time.

Categories of hallucination

Three categories matter for production detection. Intrinsic hallucinations contradict the source material the model was supposed to use. Extrinsic hallucinations are claims that go beyond the source material into invention. Faithfulness failures occur when the model produces output that does not address the user's actual question while sounding like it did.

Each category needs a different detection strategy. A single hallucination metric is not adequate for any production system that handles meaningful risk.

Uncertainty estimation

Token-level log probabilities, when available, give a rough proxy for model confidence. OpenAI exposes logprobs through the chat completions API. Anthropic's Claude API does not expose token logprobs directly, but you can estimate uncertainty through sampling consistency. Generate the same response with temperature 0.7 several times and measure agreement. High disagreement is a strong signal of uncertainty regardless of the model's stated confidence.

  • Calculate average token log probability and flag responses below threshold
  • Run self-consistency sampling (3 to 5 generations) for high-stakes outputs
  • Use the variance between samples as a confidence proxy
  • Calibrate thresholds per task type, not globally

Retrieval grounding checks

For RAG systems, every claim in the output should map to a retrieved chunk. The grounding check runs a secondary model (or the same model in a verification role) over the output and the retrieved context, asking whether each claim is supported. The Self-RAG paper from 2023 formalized this, but the basic approach works without specialized models.

Practical implementations split the output into atomic claims, run a verifier against each claim, and surface the unsupported claims either to the user as caveats or to a human reviewer for high-risk applications.

Fact verification against authoritative sources

For domains with authoritative knowledge bases (medical drug interactions, legal citations, product catalogs), verify factual claims against the canonical source after generation. The LLM is the drafter, not the source of truth. A claim about a drug interaction goes through a lookup against an FDA database. A legal citation goes through a check against the case law database. A product reference goes through a SKU validation.

This is slower and more expensive than pure generation, but eliminates the worst category of hallucinations: confident assertions about checkable facts.

Metric monitoring in production

  1. Track user feedback signal (thumbs down, follow-up correction, escalation)
  2. Track downstream consequence metrics (refund requests after AI customer support interactions, for example)
  3. Sample 1 to 5 percent of conversations for human review with hallucination tagging
  4. Build per-task accuracy dashboards segmented by model version
  5. Maintain a regression suite of known hallucination cases for model upgrades

Model-specific behaviors in 2024

GPT-4 hallucinates differently from Claude 3 Opus and Gemini 1.5 Pro. Claude tends to be more cautious and refuse rather than confabulate. GPT-4 hallucinates with high fluency, which makes the output more dangerous because it sounds authoritative. Gemini 1.5 Pro with large context shows different failure patterns where it can lose track of earlier instructions over very long contexts. Tune your detection thresholds per model and re-tune on every model upgrade.

What to tell users

Be honest. Surface uncertainty when you detect it. Refusing to answer is better than confidently lying. Inline citations build trust. A graceful fallback to human review for low-confidence outputs is a feature, not a defect. Users handle uncertainty well when it is communicated. They abandon products that hallucinate confidently.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.