Token budget

The maximum tokens you allocate per request across prompt, tools, and output to control latency and cost.

When to use it

  • You must guarantee UX speed (e.g., <1s) under peak load.
  • Finance asks for predictable unit economics on AI features.
  • You add more context blocks and fear hitting provider limits.

PM decision impact

Token budgets are product constraints like memory or battery. They shape how much context you can afford, which model tier you choose, and how often you retry. PMs balance quality gains from more context against cost, speed, and rate-limit risks. Clear budgets unblock engineers and make experiments comparable.

How to do it in 2026

Set per-route token caps (prompt, tools, output). Measure p95 tokens and tie them to latency SLOs. Add guards that truncate non-critical context first. In 2026, maintain a budget dashboard per feature: tokens, cost per 1000 calls, and quality deltas after each change. Use streaming responses to hide latency when budgets are tight.

Example

A triage bot had p95 latency of 2.8 s. By capping system + examples to 700 tokens and trimming citations to top 3 chunks, p95 dropped to 1.4 s while accuracy dipped only 1.5%. Monthly inference spend fell 18%.

Common mistakes

  • Tracking average tokens instead of p95/p99, missing tail latency issues.
  • Letting output length be unbounded, which surprises hosting costs.
  • Changing prompts without updating budgets and alerts.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026