Using a model to score another model’s outputs against criteria, often faster and cheaper than human labeling.
Model judges speed experimentation but can introduce bias or false confidence. PMs must calibrate judges against human labels and monitor for drift. They reduce ops costs but still need human spot checks on high-risk areas.
Define clear rubrics and provide reference answers. Calibrate judge outputs against human scores on a sampled set; adjust prompts or choose safer judge models. In 2026, ensemble judges (cheap model + specialist judge on ties) to balance cost and fidelity.
A 300-case eval uses an LLM judge aligned to human labels with 0.86 Spearman correlation. It lets the team test five prompt variants overnight, picking a winner with +5% quality at the same cost.