LLM-as-a-judge

Using a model to score another model’s outputs against criteria, often faster and cheaper than human labeling.

When to use it

  • You need large-scale evals without equivalent human capacity.
  • Comparing prompt/model variants where relative ranking matters.
  • Monitoring drift daily with limited budget.

PM decision impact

Model judges speed experimentation but can introduce bias or false confidence. PMs must calibrate judges against human labels and monitor for drift. They reduce ops costs but still need human spot checks on high-risk areas.

How to do it in 2026

Define clear rubrics and provide reference answers. Calibrate judge outputs against human scores on a sampled set; adjust prompts or choose safer judge models. In 2026, ensemble judges (cheap model + specialist judge on ties) to balance cost and fidelity.

Example

A 300-case eval uses an LLM judge aligned to human labels with 0.86 Spearman correlation. It lets the team test five prompt variants overnight, picking a winner with +5% quality at the same cost.

Common mistakes

  • Blindly trusting judge scores without calibration.
  • Using the same model family for generation and judging, inflating scores.
  • Not updating rubrics as product requirements change.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026