A second-pass model or heuristic that orders retrieved items by relevance before feeding them to the LLM.
Reranking improves answer precision and user trust, often at small cost. PMs weigh the added latency and cost against reduced hallucinations and support load. It also influences how many citations you can show without clutter, affecting UX clarity.
Use lightweight cross-encoder or LLM-based rerankers on the top 20–50 retrieved items. Optimize for your KPI (answer correctness, click rate). In 2026, run rerankers on GPUs or specialized services to keep p95 low, and monitor win-rate experiments versus a control retrieval-only path.
Adding a 50 ms reranker to support search cut irrelevant citations by 38% and improved self-serve resolution from 46% to 55% without breaching the 1.5 s latency SLO.