A curated collection of test cases with trusted answers used to judge model quality over time.
Golden sets anchor quality discussions. PMs decide composition (edge cases, critical intents), ownership, and refresh cadence. A too-small or stale set gives false confidence; a well-maintained one reduces incidents and accelerates approvals.
Collect real user queries and failure cases; anonymize and label with expected outcomes and rationales. Keep size lean (50–200 per feature) and refresh monthly. In 2026, semi-automate refresh with sampled production traffic reviewed by humans.
A support bot’s 120-case golden set catches a regression where refunds above $500 are denied incorrectly. Fix ships before rollout, avoiding a spike in tickets and refunds backlog.