BIPI
BIPI

Building Eval Harnesses for LLM Apps: Deterministic, Embedding, Judge

AI Security

You cannot ship LLM features safely without an eval harness. We walk through the three layers that matter, deterministic checks, embedding similarity, and judge model scoring, and how to wire them into CI so prompt changes do not silently regress.

By Arjun Raghavan, Security & Systems Lead, BIPI · July 20, 2023 · 11 min read

#evals#testing#llm#ci#mlflow

The first time a prompt change regresses a critical behavior in production, the team realizes evals are not optional. The second time, the team builds the harness. We recommend skipping ahead to the second time.

Three layers, three purposes

  • Deterministic checks, for invariants that must always hold
  • Embedding similarity, for fuzzy semantic matches against gold answers
  • Judge model scoring, for qualitative properties like helpfulness or tone

Deterministic checks

Schema conformance, refusal on disallowed inputs, presence of citations, length bounds, and forbidden phrases. These are fast, free, and catch most regressions. Run them on every commit.

Embedding similarity

Encode the model output and the reference answer, compare with cosine similarity, and threshold. Cheap, robust to paraphrase, and good at catching semantic drift. Useful for fact heavy responses where wording varies but meaning should not.

Judge model scoring

A stronger model scores the output against a rubric. Slower and more expensive, but the only way to evaluate qualitative properties at scale. Calibrate the judge with human labeled examples before trusting it.

Dataset hygiene

  1. Maintain a frozen regression set that never changes
  2. Add a growing exploratory set for new behaviors
  3. Include adversarial examples from red team work
  4. Tag every example with the behavior it covers
  5. Track coverage as you add new prompt features

Wiring into CI

Run the deterministic and embedding layers on every pull request, run the judge layer nightly or on release branches. Block merges on regressions in critical metrics, surface non critical regressions as warnings.

Tooling choices

Weights and Biases and MLflow both track eval runs over time and surface regressions. LangChain has an evaluators module, and Anthropic and OpenAI publish reference rubrics. NeMo Guardrails has a colang based eval flow.

An eval harness is the smallest commitment a team can make to taking LLM quality seriously. Ship one before the second prompt change.

Security oriented evals

Add a suite of injection attempts, jailbreaks, and policy probes. Track block rate over time as a leading indicator. When the rate drops, treat it as a regression even if user facing metrics look fine.

Closing

The harness is your seatbelt. It does not make the model safer, it makes the team faster, because every change can be reviewed against a clear signal instead of a feeling.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.