LLM Evaluation Beyond Vibes: Building an Eval Harness for Production
Agentic AI
'It feels right' is the most common evaluation method we audit. Sometimes that's enough. Often it isn't, and the failure shows up six weeks later when behavior drifts. Real evaluation is a test harness — automated, regression-safe, runs on every prompt change.
By Arjun Raghavan, Security & Systems Lead, BIPI · June 12, 2025 · 8 min read
'It feels right' is the most common LLM evaluation method we encounter when auditing AI systems. The team writes a prompt, tests three or four queries by hand, declares the system works, ships it. Sometimes that is enough. Often it is not, and the failure mode shows up six weeks later when the model provider quietly updates the underlying weights, or someone tweaks the prompt for an unrelated reason and an entire class of queries silently breaks.
Real evaluation is a test harness. Automated, regression-safe, runs on every prompt change. Here is the shape we deploy on production AI systems.
Why human evaluation does not scale
Humans are expensive, slow, and inconsistent. A typical prompt-engineering iteration touches the system twice an hour. If each touch needs three humans to manually grade fifty test cases, you have hired a lab. Most teams have not.
Worse, the things humans grade well (factual accuracy, tone, completeness) are also the things that drift slowly. By the time a human evaluator notices that quality has degraded, you are weeks behind the regression that caused it.
The four evaluation tiers
We layer evaluation in four tiers from cheap to expensive.
- Deterministic checks. The output must contain a specific JSON shape, must include a citation, must respond in under 800 tokens, must not contain a forbidden phrase. Pure string and structural matching. Runs in milliseconds. Catches the 60% of regressions that are 'the model stopped emitting valid JSON.'
- Embedding similarity. Compute the embedding of the model's output, compare against the embedding of a known-good answer. Above a threshold = pass. Catches 'the answer is semantically the same as before' regressions. Cheap.
- LLM-as-judge. A second LLM grades the output against a rubric. Useful for graded dimensions (helpfulness 1-5, factuality 1-5). Use a different model than the one being graded to avoid self-preference bias. Costs roughly the same as the original generation.
- Human spot-check. The 5-10% of test cases where the grade was ambiguous. A human reviews these weekly. Disagreements between human and LLM-as-judge feed back into rubric refinement.
LLM-as-judge: when it works
LLM-as-judge is the highest-leverage tier. It scales like deterministic tests but covers the 'is this actually a good answer' dimensions. It also has a known failure mode: the judge model agrees with itself. If you grade GPT-4 outputs with GPT-4, the judge is biased toward outputs that look like its own.
Mitigations: use a different model family for the judge (e.g., grade Claude with GPT-4 or vice versa). Provide explicit rubrics with examples of good and bad answers. Force the judge to cite specific phrases from the output that justify its score. Calibrate the judge against human-graded examples quarterly.
Test set design
The eval set determines what your system optimizes for. Get the set wrong and you get a system that passes evals but fails in production.
We build test sets in three layers. Golden cases: 50-100 hand-curated queries with hand-curated expected answers, covering the full spectrum of intents. These never change. Adversarial cases: queries designed to break the system — ambiguous inputs, prompt injections, malformed data, edge cases from production logs. Production samples: actual production queries (anonymised), sampled weekly. The system needs to handle real distribution, not just the cases the team imagined.
What we ship
Our default deployment is a CI step that runs evals on every prompt change.
- Eval set lives in source control as JSON. Each entry has input, expected_structure, expected_concepts, and metadata.
- Pull request opens. CI pulls the candidate prompt, runs all eval cases against the model.
- Tier-1 (deterministic) failures block the merge.
- Tier-2 (embedding) failures with similarity below threshold are surfaced as warnings.
- Tier-3 (LLM judge) outputs are summarized as a delta against main: helpfulness +0.3, factuality -0.1, etc.
- Tier-4 (human review) is queued for the weekly synthesis meeting.
- Production deploys log every model output to the eval store, so the test set is always growing from real traffic.
Closing
LLM systems without an eval harness ship on faith. The drift is silent. By the time someone notices, weeks of regressions have stacked. Build the harness. Make every prompt change pass it. Make every production failure end up as a new test. The discipline is the same as software testing — the tools are different, but the principle is identical.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.