The maximum tokens you allocate per request across prompt, tools, and output to control latency and cost.
Token budgets are product constraints like memory or battery. They shape how much context you can afford, which model tier you choose, and how often you retry. PMs balance quality gains from more context against cost, speed, and rate-limit risks. Clear budgets unblock engineers and make experiments comparable.
Set per-route token caps (prompt, tools, output). Measure p95 tokens and tie them to latency SLOs. Add guards that truncate non-critical context first. In 2026, maintain a budget dashboard per feature: tokens, cost per 1000 calls, and quality deltas after each change. Use streaming responses to hide latency when budgets are tight.
A triage bot had p95 latency of 2.8 s. By capping system + examples to 700 tokens and trimming citations to top 3 chunks, p95 dropped to 1.4 s while accuracy dipped only 1.5%. Monthly inference spend fell 18%.