Stage 2: Re-ranking Methods¶
This section covers methods for the second stage of the RAG pipeline: precisely scoring the candidate documents retrieved in Stage 1.
Stage 2 Topics:
Overview¶
Stage 2 re-ranking focuses on precision over speed. Since the candidate set is small (typically 10-1000 documents), we can afford more expensive computations to get highly accurate relevance scores.
Why Re-ranking is Needed¶
The Stage 1 Limitation
Bi-encoders (Stage 1) encode query and document independently:
No interaction between query and document tokens
Can’t perform complex reasoning about relevance
Limited to similarity in embedding space
The Stage 2 Solution
Re-rankers encode query and document jointly:
Full attention between all query-document token pairs
Can perform complex relevance reasoning
Much higher accuracy at cost of speed
The Accuracy Gain¶
Typical improvements when adding Stage 2:
Dataset: MS MARCO Dev (1000 candidates from Stage 1)
Bi-encoder only: MRR@10 = 0.311
+ Cross-encoder: MRR@10 = 0.389 (+25% improvement!)
Dataset: Natural Questions
Bi-encoder only: Top-10 Accuracy = 0.68
+ Cross-encoder: Top-10 Accuracy = 0.81 (+19% improvement!)
Key Insight: Re-ranking the top-100 with cross-encoder provides massive gains for just 100 forward passes (~5 seconds).
Architecture Types¶
Cross-Encoders¶
Most Common: BERT-based cross-encoder
Concatenates:
[CLS] query [SEP] document [SEP]Self-attention across all tokens
Classification head predicts relevance
Highest accuracy
Variants:
MonoBERT: BERT cross-encoder for binary classification
MonoT5: T5 model generates “true”/”false” token
RankT5: T5 generates relevance score directly
RankLlama: Large language model fine-tuned for ranking
Poly-Encoders¶
Middle Ground: Faster than cross-encoder, better than bi-encoder
Document → Multiple learned “codes” (e.g., 64 codes)
Query attends to codes
Much faster than cross-encoder (can pre-compute codes)
LLM Re-rankers¶
Latest Trend: Zero-shot re-ranking with instruction-tuned LLMs
Prompt LLM: “Is this passage relevant to this query?”
No training needed
Can provide explanations
Expensive but highly effective
Organization of This Section¶
Cross-Encoders (Cross-Encoders for Re-ranking)
Traditional BERT-based re-rankers
MonoT5 and RankT5
Training strategies
Implementation guide
LLM Re-rankers (LLM-Based Re-rankers)
Zero-shot prompting approaches
RankGPT, RankLlama
Listwise vs pointwise ranking
Cost-performance trade-offs
When to Use Stage 2¶
✅ You Need Stage 2 When:
Top-10 accuracy is critical (user sees only first page)
False positives are costly (e.g., medical, legal)
You can afford 1-10 second latency
Final answer quality >> speed
❌ You Can Skip Stage 2 When:
Latency must be < 100ms (real-time autocomplete)
Top-100 recall is all that matters (no precision needed)
Very simple queries (BM25 or bi-encoder sufficient)
Candidates from Stage 1 are already very precise
The Two-Stage Pipeline¶
Standard Configuration:
# Stage 1: Fast retrieval (100-1000 candidates)
bi_encoder = SentenceTransformer('BAAI/bge-base-en-v1.5')
candidates = bi_encoder.search(query, corpus, top_k=100)
# Stage 2: Precise re-ranking (top-10)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
reranked = cross_encoder.rerank(query, candidates, top_k=10)
Cost Analysis:
Stage 1: 10M docs × 0.00001s = 0.1s (with FAISS)
Stage 2: 100 docs × 0.05s = 5s (cross-encoder)
Total: ~5s (vs ~140 hours if cross-encoder on full corpus!)
Next Steps¶
See Cross-Encoders for Re-ranking for traditional BERT-based re-rankers
See LLM-Based Re-rankers for modern LLM-based approaches
See Reranker Survey: Models, Libraries, and Frameworks for a comprehensive survey of 22+ reranking methods
See Late Interaction (ColBERT) for ColBERT (can replace both stages)