How you split documents or histories into pieces for indexing so retrieval balances relevance, completeness, and speed.
Chunking determines recall quality and token cost. PMs choose boundaries (semantic vs. fixed), overlap, and maximum size. Good chunking reduces hallucinations and speeds users to answers; poor chunking forces the model to stitch context and slows everything down.
Start with semantic or heading-based splits, then cap tokens (e.g., 200–400) with small overlap for continuity. Keep source IDs and titles for citation clarity. In 2026, tune chunk sizes per collection (support vs. docs) and A/B test recall + latency on real queries, not synthetic ones.
By switching release notes to 250-token semantic chunks with 10% overlap, correct-answer rate improved 9 points and p95 latency fell 300 ms because fewer irrelevant chunks were retrieved.