BIPI
BIPI

LLM-as-Judge in 2026: Where It Agrees with Humans, Where It Lies

Agentic AI

LLM judges have known biases (position, length, self-preference) that show up in production evals. We share calibration techniques, ensemble setups, and the cases where rubric-based scoring beats free-form judgment.

By Arjun Raghavan, Security & Systems Lead, BIPI · May 14, 2024 · 8 min read

#llm#evaluation#judges

A media client ran an A/B test between two assistant variants using an LLM judge. Variant B won by a comfortable 14 points. Two weeks later, customer satisfaction told the opposite story. We went back to the eval. The judge had been preferring B because B's responses were 30 percent longer and the judge had a length bias the team had not tested for.

LLM-as-judge is the most useful evaluation tool we have for open-ended outputs. It is also the most easily fooled. Treat it like a measurement instrument: characterise its biases, calibrate it, and double-check the cases where stakes are high.

The biases you will see

Across our last 18 evaluation projects, four biases showed up consistently enough to plan around.

  • Position bias: the option presented first, or last, depending on the model, gets a small but real preference boost.
  • Length bias: longer answers score higher even when the longer parts are filler.
  • Self-preference: a model judging its own outputs against another's tends to score itself higher. This is well-documented across families.
  • Style bias: judges prefer answers in their own characteristic style, including markdown formatting, list-heavy structure, and certain politeness markers.

The size of these biases varies by model and task. We measure them on every engagement. Skipping that step is how you ship the media client's mistake.

Calibration that takes a day, saves a quarter

Before trusting any judge, run two calibration experiments.

  1. Position swap: run every pairwise comparison twice, with order flipped. If the judge changes its answer in more than 5 percent of cases, position bias is material. Apply ordering randomisation in production evals.
  2. Length-controlled pairs: hand-construct pairs where the worse answer is longer. Measure how often the judge prefers the longer one. If above 60 percent, the judge cannot be trusted on free-form length-varying outputs without controls.

On a financial services engagement, this calibration saved a launch. The proposed judge had a 71 percent length bias. We swapped to a rubric-based judge that scored each answer against fixed criteria, independent of the alternative. Bias dropped to manageable levels and the eval became defensible to the compliance team.

Rubric-based scoring beats free-form for high-stakes calls

Free-form judgment (which is better, A or B) is fast and cheap. It is also the most bias-prone setup. Rubric-based scoring, where the judge scores each answer against a fixed checklist, removes most of the comparative biases at the cost of more tokens per eval.

Our rule: if the eval will inform a launch decision or a compensation discussion, use rubrics. If it is a daily quality signal on a stable feature, free-form pairwise is acceptable, with order randomisation.

When to spend the money on humans

LLM judges are not a replacement for human evaluation in three cases. First, when the task requires domain expertise the model lacks (medical diagnosis, legal nuance, niche compliance). Second, when the bar is subjective in a way that humans agree on but models do not capture (brand voice, taste). Third, when you need defensibility for regulators or executives who will not accept model scores.

Even then, the right pattern is layered: LLM judges run nightly across thousands of cases, humans run weekly across a stratified sample of a few hundred. The LLM signal flags candidates. Humans confirm. This is roughly 5 to 10 percent of the cost of all-human evals with comparable reliability for most tasks we have benchmarked.

Agreement with humans, in numbers

82-91%
Frontier judge agreement with majority human label, summarisation
67-78%
Same metric on long-form code review tasks
54-63%
Same metric on subjective tone and brand voice

Numbers are from a mix of public benchmarks and our own engagement work. The pattern: agreement is high enough for routine quality monitoring, not high enough to bet a roadmap on without spot-checking. Use judges as a fast, biased instrument that you have characterised. That is how every measurement instrument has always worked. LLMs are not different, they just feel like they should be.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.