Live experiments that measure model changes with real user traffic, often via A/B tests or shadow deployments.
Online evals tie quality to business metrics. PMs choose guardrails, success metrics, and stop conditions. They must weigh experiment speed against risk, ensuring blast radius is limited. Good online evals prevent costly rollbacks and build confidence in AI features.
Start with small traffic (1–5%), monitor leading indicators (errors, latency, refusals), then business KPIs. Use shadow mode for high-risk changes. In 2026, merge eval telemetry with cost dashboards so decisions include unit economics, not just accuracy.
A new summarization prompt shows +4% task success and -12% latency in a 10% A/B. No safety regressions observed; rollout progresses to 50% the same day.