Online evals

Live experiments that measure model changes with real user traffic, often via A/B tests or shadow deployments.

When to use it

  • After offline evals look good and you need real-world confirmation.
  • Testing changes that affect engagement, conversion, or support load.
  • Validating safety or latency under real traffic patterns.

PM decision impact

Online evals tie quality to business metrics. PMs choose guardrails, success metrics, and stop conditions. They must weigh experiment speed against risk, ensuring blast radius is limited. Good online evals prevent costly rollbacks and build confidence in AI features.

How to do it in 2026

Start with small traffic (1–5%), monitor leading indicators (errors, latency, refusals), then business KPIs. Use shadow mode for high-risk changes. In 2026, merge eval telemetry with cost dashboards so decisions include unit economics, not just accuracy.

Example

A new summarization prompt shows +4% task success and -12% latency in a 10% A/B. No safety regressions observed; rollout progresses to 50% the same day.

Common mistakes

  • Skipping blast-radius limits and exposing all users at once.
  • Measuring only clicks or engagement without correctness checks.
  • Ignoring cost impacts, leading to negative margins at scale.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026