Context window

The maximum token length a model can attend to at once across input and output.

When to use it

  • Designing flows that combine long documents, chat history, and tool calls.
  • Choosing between model families with different window sizes and prices.
  • Planning evals to catch truncation or lost instructions.

PM decision impact

Context window dictates feature ceilings: how long a conversation can run, how rich a brief can be, and whether you need retrieval. Bigger windows allow richer UX but cost more and can hurt latency. PMs must set policies for truncation and decide when to upgrade models versus investing in compaction or retrieval.

How to do it in 2026

Calculate worst-case token usage per flow (instructions + user text + history + citations + output). Add headroom for safety rails and tool schemas. Implement smart truncation (oldest-first for chat, low-signal sections for docs). In 2026, combine long-context models for drafting with cheaper short-context models for quick actions to balance cost and quality.

Example

A research assistant started truncating safety instructions after 30 turns, causing two incident tickets. Adding conversation summarization every 8 turns and reserving 400 tokens for guardrails eliminated truncation while keeping latency under 1.6 s.

Common mistakes

  • Ignoring output tokens when sizing windows, leading to mid-sentence cutoffs.
  • Letting history grow unchecked, which erases system instructions.
  • Choosing a larger window without measuring quality vs. cost tradeoffs.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026