Overview of RAG and the Two-Stage Pipeline¶
Retrieval-Augmented Generation (RAG) systems typically employ a two-stage architecture that balances efficiency with accuracy. Understanding this distinction is crucial for selecting appropriate methods and architectures.
The Two-Stage Architecture¶
┌─────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌─────────┐
│ Query │ ───► │ Stage 1 │ ───► │ Stage 2 │ ───► │ LLM │
│ │ │ (Top-1000) │ │ (Top-10) │ │ │
└─────────┘ │ Fast Retrieval │ │ Precision │ └─────────┘
└──────────────────┘ │ Re-ranking │
└──────────────────┘
Conceptual flow: Query → Fast Retrieval (Stage 1) → Precision Re-ranking (Stage 2) → LLM Generation
Stage 1: Retrieval (Candidate Selection)¶
Goal: Efficiently fetch a small set of candidate documents (e.g., top-100 to top-1000) from a massive collection (millions to billions of documents).
Key Requirement: Speed - Must process millions of documents in milliseconds.
Trade-off: Sacrifices some accuracy for speed by using simpler similarity computations.
Approaches¶
Sparse Retrieval (BM25)
Uses keyword matching and term frequency statistics
Extremely fast (inverted index lookup)
Limited by vocabulary mismatch problem
Best for: Keyword-heavy queries, legal/medical domains
Dense Retrieval (Bi-Encoders)
Encodes queries and documents into fixed-size vectors independently
Similarity computed via dot product or cosine similarity
Can pre-compute and index all document vectors
Captures semantic meaning beyond keywords
Examples: DPR, ANCE, BGE, E5
Late Interaction (ColBERT)
Hybrid approach: stores multiple vectors per document (one per token)
More expressive than single-vector but still indexable
Bridges gap between Stage 1 and Stage 2
Can serve both roles depending on implementation
Hybrid Methods
Combines sparse (BM25) and dense retrieval
Leverages complementary strengths
Examples: SPLADE, DENSPI, Semantic Residual
Key Characteristics¶
Property |
Description |
|---|---|
Indexing |
Pre-computes document representations offline |
Retrieval Speed |
Sub-second for millions of documents |
Architecture |
Dual-encoder (query and doc encoded separately) |
Similarity |
Simple operations (dot product, cosine) |
Recall Focus |
Prioritizes not missing relevant documents |
Stage 2: Re-ranking (Precision Scoring)¶
Goal: Precisely score the small set of candidates (e.g., 100 documents) retrieved in Stage 1 to produce a final ranked list (e.g., top-10).
Key Requirement: Accuracy - Must identify the truly relevant documents with high precision.
Trade-off: More computational cost is acceptable since candidate set is small.
Approaches¶
Cross-Encoders
Concatenates query and document:
[CLS] query [SEP] document [SEP]Full self-attention between all query-document token pairs
Most accurate but slowest
Must score each (query, doc) pair independently
Examples: BERT re-ranker, MonoT5, RankLlama
Poly-Encoders
Middle ground between bi-encoders and cross-encoders
Uses learned “attention codes” to aggregate information
Faster than cross-encoders, more accurate than bi-encoders
Can cache some computation
Late Interaction (ColBERT as Re-ranker)
Performs MaxSim operation over token pairs
More fine-grained than single-vector similarity
Can be used for both retrieval and re-ranking
Efficient enough for top-100 candidates
Key Characteristics¶
Property |
Description |
|---|---|
Indexing |
Not needed (online scoring) |
Scoring Speed |
Can be slower (only 10-1000 candidates) |
Architecture |
Cross-encoder (joint encoding of query-doc) |
Interaction |
Full attention between query and document |
Precision Focus |
Prioritizes ranking truly relevant docs at top |
The Efficiency-Accuracy Trade-off¶
Accuracy ↑
│
│ ┌─── Cross-Encoder (Stage 2)
│ ┌───┤
│ ┌───┘ └─── Poly-Encoder
│ ┌───┘
│ ┌───┘ ColBERT (Can do both!)
│ ┌───┘
│ ┌───┘ Dense Bi-Encoder (Stage 1)
│ ┌───┘
│ ┌───┘ BM25 (Sparse)
│┌───┘
└─────────────────────────────────────────────► Speed
Why Two Stages?¶
Computational Reality
Assumptions: BERT-base model, single V100 GPU, MS MARCO-scale corpus (8.8M passages)
Running a cross-encoder on 10 million documents × 1 query = 10 million forward passes
At ~50ms per forward pass (BERT-base): 500,000 seconds = 139 hours per query ❌
Dense retrieval on 10 million documents: ~100ms per query (FAISS HNSW index) ✅
Then cross-encoder on top-100 candidates: ~5 seconds per query ✅
Total: ~5 seconds vs 139 hours
Note
Model-specific latencies (100 docs, V100 GPU):
TinyBERT: 0.5-1s
MiniLM-L6: 1.5-2.5s
BERT-base: 5-8s
BERT-large: 15-20s
The Pipeline Math
Where: - N = corpus size (millions) - k = candidates to re-rank (100-1000) - C = cross-encoder cost per pair
Note
Complexity Clarification: ANN index construction is O(N log N), but query time is O(log N) for HNSW or O(√N) for IVF-based methods. The query-time complexity is what matters for latency.
Since k ≪ N, this is vastly more efficient than O(N · C).
Common Configurations¶
Configuration 1: Standard Two-Stage¶
# Stage 1: Dense Retrieval
bi_encoder = SentenceTransformer("BAAI/bge-base-en-v1.5")
candidates = bi_encoder.retrieve(query, corpus, top_k=100)
# Stage 2: Cross-Encoder Re-ranking
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = cross_encoder.predict([(query, cand) for cand in candidates])
final_results = rank_by_score(candidates, scores)[:10]
Use case: Maximum accuracy, acceptable latency (1-5 seconds)
Configuration 2: Hybrid Retrieval + Cross-Encoder¶
# Stage 1: Hybrid (Sparse + Dense)
bm25_results = bm25.search(query, top_k=100)
dense_results = bi_encoder.search(query, top_k=100)
candidates = merge_and_dedupe(bm25_results, dense_results) # ~150 docs
# Stage 2: Cross-Encoder
final_results = cross_encoder.rerank(query, candidates)[:10]
Use case: Robust to both keyword and semantic queries
Configuration 3: ColBERT-only (Single Stage)¶
# ColBERT does both retrieval and fine-grained matching
colbert = ColBERT("colbert-ir/colbertv2.0")
results = colbert.search(query, corpus, top_k=10)
Use case: When you want late interaction quality with single-stage simplicity
Configuration 4: Three-Stage (Sparse → Dense → Cross)¶
# Stage 1a: BM25 (very fast, 10K candidates)
bm25_candidates = bm25.search(query, top_k=10000)
# Stage 1b: Dense retrieval (re-rank to 100)
dense_scores = bi_encoder.score(query, bm25_candidates)
top_100 = rank_by_score(bm25_candidates, dense_scores)[:100]
# Stage 2: Cross-Encoder (final top-10)
final_results = cross_encoder.rerank(query, top_100)[:10]
Use case: Maximum recall (BM25 cast wide net) + maximum precision (cross-encoder)
When to Use What¶
Method |
Best For |
Latency |
Use Case |
|---|---|---|---|
BM25 Only |
Keyword queries |
< 50ms |
Legal search, exact match |
Bi-Encoder Only |
Semantic search |
50-200ms |
FAQ matching, chatbots |
Bi-Encoder + Cross |
Accuracy critical |
1-5s |
Question answering, RAG |
Hybrid + Cross |
Robustness critical |
2-10s |
Enterprise search |
ColBERT |
Balance of both |
200-500ms |
Research, precision needs |
Key Research Questions¶
The field continues to evolve around several key questions:
For Stage 1 (Retrieval)
How to mine hard negatives that improve discrimination?
How to handle false negatives in training data?
How to make models robust to domain shift?
How to efficiently update indexes as models improve?
For Stage 2 (Re-ranking)
How to train cross-encoders with limited labeled data?
How to distill cross-encoder knowledge into faster models?
How to handle position bias in training data?
How to make re-rankers calibrated (output meaningful probabilities)?
Cross-Cutting
Can we learn a single model that does both stages well?
How to leverage LLMs as zero-shot re-rankers?
How to optimize end-to-end (retrieval + re-ranking + generation)?
Organization of This Documentation¶
This documentation is organized to reflect the two-stage architecture:
Stage 1: Retrieval (Stage 1: Retrieval Methods)
Sparse methods (BM25)
Dense baselines (DPR, RepBERT)
Hard negative mining - The core focus of dense retrieval research
Late interaction (ColBERT)
Hybrid approaches
Stage 2: Re-ranking (Stage 2: Re-ranking Methods)
Cross-encoders
Poly-encoders
LLM-based re-rankers
Each section provides:
Paper surveys with links to code
Detailed explanations of key innovations
Implementation guidance
Trade-off analysis
Next Steps¶
Start with Stage 1: Retrieval Methods to understand dense retrieval fundamentals
Focus on Hard Negative Mining for the critical bottleneck in training
See Stage 2: Re-ranking Methods for precision re-ranking methods
Refer to Contributing to add papers or improve documentation