Reducing and reshaping context (summaries, salience scoring, deduplication) so key facts fit within token and latency budgets.
Compaction keeps quality high without upgrading to pricier long-context models. PMs choose what to discard versus keep, balancing fidelity and speed. Poor compaction can hide risks (missing safety text) or degrade answer quality, directly affecting trust, support tickets, and unit economics.
Apply layered compaction: dedupe chunks, score relevance, then summarize into short canonical notes. Reserve protected tokens for safety and persona blocks. In 2026, use structured summaries (bullets with source IDs) so they can be re-expanded if needed. Measure impact via offline QA plus latency and cost deltas.
A product discovery assistant summarized each 20-minute call into five bullet insights with source IDs. Token use per turn dropped 40%, latency improved by 600 ms, and researchers still recovered exact quotes through source IDs when needed.