Chunking strategy

How you split documents or histories into pieces for indexing so retrieval balances relevance, completeness, and speed.

When to use it

  • Building RAG over mixed-length documents (FAQs, specs, tickets).
  • You see hallucinations from missing context or irrelevant chunks.
  • Latency is rising because chunks are too small or too many.

PM decision impact

Chunking determines recall quality and token cost. PMs choose boundaries (semantic vs. fixed), overlap, and maximum size. Good chunking reduces hallucinations and speeds users to answers; poor chunking forces the model to stitch context and slows everything down.

How to do it in 2026

Start with semantic or heading-based splits, then cap tokens (e.g., 200–400) with small overlap for continuity. Keep source IDs and titles for citation clarity. In 2026, tune chunk sizes per collection (support vs. docs) and A/B test recall + latency on real queries, not synthetic ones.

Example

By switching release notes to 250-token semantic chunks with 10% overlap, correct-answer rate improved 9 points and p95 latency fell 300 ms because fewer irrelevant chunks were retrieved.

Common mistakes

  • Using one chunk size for all content types, hurting either recall or latency.
  • Dropping headings or metadata, making citations unreadable.
  • Overlapping too much, which bloats the index and slows queries.

Related terms

Learn it in CraftUp

Last updated: February 2, 2026