Context compaction

Reducing and reshaping context (summaries, salience scoring, deduplication) so key facts fit within token and latency budgets.

When to use it

  • Conversations or documents routinely blow past model limits.
  • Latency or cost spikes after adding history or citations.
  • You need to protect critical instructions from truncation.

PM decision impact

Compaction keeps quality high without upgrading to pricier long-context models. PMs choose what to discard versus keep, balancing fidelity and speed. Poor compaction can hide risks (missing safety text) or degrade answer quality, directly affecting trust, support tickets, and unit economics.

How to do it in 2026

Apply layered compaction: dedupe chunks, score relevance, then summarize into short canonical notes. Reserve protected tokens for safety and persona blocks. In 2026, use structured summaries (bullets with source IDs) so they can be re-expanded if needed. Measure impact via offline QA plus latency and cost deltas.

Example

A product discovery assistant summarized each 20-minute call into five bullet insights with source IDs. Token use per turn dropped 40%, latency improved by 600 ms, and researchers still recovered exact quotes through source IDs when needed.

Common mistakes

  • Summarizing away nuance like numeric targets or dates, causing wrong recommendations.
  • Running compaction synchronously on the critical path, adding more latency than it saves.
  • Failing to keep source references, making audits and citations impossible.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026