A repeatable pipeline that scores model or agent outputs against test cases and business metrics before and after changes.
The harness lets PMs ship faster with less fear. It defines the bar for quality and catches breaking changes before users do. It also shapes what you measure—if the harness ignores latency or cost, those regressions slip through. Good harnesses shorten cycles and reduce incident risk.
Build a small, representative golden set per feature; include failure modes and edge cases. Automate runs in CI for every prompt or dependency change. Track accuracy, refusal, latency, and cost. In 2026, add grounding checks and PII leak detectors by default and store historical runs to spot drift.
Before updating the retrieval corpus, the team runs the harness: accuracy +8 points, latency +90 ms but still under SLA, no new PII leaks. Update ships same day instead of waiting for manual QA.