Latency budget

The maximum response time you can spend across model calls, tools, and orchestration while meeting UX and business goals.

When to use it

  • Designing multi-step AI flows with tools and retrieval.
  • Planning experiments that may add extra model passes.
  • Keeping SLOs for chat or realtime interactions.

PM decision impact

Latency shapes adoption and satisfaction. PMs allocate budget per step, decide where to cache, and when to stream. They trade quality against speed and cost. Clear budgets prevent scope creep and keep the experience snappy.

How to do it in 2026

Set p95 targets per surface (e.g., 1.5 s chat, 4 s long-form). Break down where time is spent; cache deterministic steps. Stream partial results when helpful. In 2026, maintain a latency scoreboard by feature and block releases that exceed budgets unless justified.

Example

A triage flow had p95 2.8 s. By precomputing embeddings nightly and capping tool retries, p95 fell to 1.6 s with no quality loss, lifting CSAT by 0.2 points.

Common mistakes

  • Tracking averages instead of p95/p99, missing tail pain.
  • Ignoring client-side rendering time in the budget.
  • Allowing retries without time caps, causing timeouts.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026