Reranking

A second-pass model or heuristic that orders retrieved items by relevance before feeding them to the LLM.

When to use it

Recall is high but precision is low—users see wrong citations.
Latency budgets allow an extra 50–150 ms to improve quality.
You’re adding more sources and need to keep outputs concise.

PM decision impact

Reranking improves answer precision and user trust, often at small cost. PMs weigh the added latency and cost against reduced hallucinations and support load. It also influences how many citations you can show without clutter, affecting UX clarity.

How to do it in 2026

Use lightweight cross-encoder or LLM-based rerankers on the top 20–50 retrieved items. Optimize for your KPI (answer correctness, click rate). In 2026, run rerankers on GPUs or specialized services to keep p95 low, and monitor win-rate experiments versus a control retrieval-only path.

Example

Adding a 50 ms reranker to support search cut irrelevant citations by 38% and improved self-serve resolution from 46% to 55% without breaching the 1.5 s latency SLO.

Common mistakes

Reranking too few candidates, missing the real answer.
Running heavy rerankers synchronously when latency budgets are tight.
Not aligning reranker training data with actual user queries.

Learn it in CraftUp

AI agents impact patterns (blog)Start learning free in the CraftUp app

Last updated: February 2, 2026