Stage 1: Retrieval Methods¶
This section covers methods for the first stage of the RAG pipeline: efficiently retrieving candidate documents from large corpora.
Stage 1 Topics:
- Sparse Retrieval Methods
- Dense Baselines & Fixed Embeddings
- Hard Negative Mining
- Late Interaction (ColBERT)
- Short Answer
- Long Answer: Why Late Interaction is Special
- The ColBERT Architecture
- ColBERT Literature
- Other Multi-Vector Methods
- How ColBERT Does Both Stages
- Performance Comparison
- When to Use ColBERT
- Implementation Example
- The Index Size Problem
- Advanced Topic: MaxSim Operation
- Comparison with Cross-Encoders
- Poly-Encoders: Another Middle Ground
- Research Directions
- Next Steps
- Hybrid Dense-Sparse Methods
- Pre-training Methods for Dense Retrievers
- Joint Learning of Retrieval and Indexing
- Literature Survey: Retrieval Methods
Overview¶
Stage 1 retrieval must balance two competing demands:
Speed: Process millions of documents in milliseconds
Recall: Don’t miss relevant documents
The solution is to use architectures that allow pre-computation and efficient similarity search.
Evolution of Stage 1 Methods¶
Era |
Method Type |
Key Innovation |
Representative Papers |
|---|---|---|---|
Pre-2020 |
Sparse (BM25) |
Inverted index, TF-IDF |
Traditional IR |
2020 |
Dense Baselines |
Dual-encoder with BERT |
DPR, RepBERT |
2021 |
Hard Negatives |
Dynamic mining, denoising |
ANCE, RocketQA, ADORE |
2021-2022 |
Late Interaction |
Token-level representations |
ColBERT, ColBERTv2 |
2022-2023 |
Sampling Strategies |
Curriculum, score-based |
TAS-Balanced, SimANS |
2023-2024 |
LLM Integration |
Synthetic negatives, prompting |
SyNeg, LLM embeddings |
Key Dimensions¶
When evaluating Stage 1 methods, consider:
Architecture
Dual-Encoder: Independent encoding (fastest)
Late Interaction: Token-level matching (more accurate)
Hybrid: Combines sparse and dense
Training Strategy
Negative Mining: How to find informative negatives?
Knowledge Distillation: Learn from cross-encoder teachers
Curriculum Learning: Progressive difficulty
Index Structure
Dense Vector: Single vector per document
Multi-Vector: Multiple vectors (e.g., ColBERT)
Learned Index: End-to-end optimized structures
The Central Challenge: Hard Negative Mining¶
The quality of Stage 1 retrievers depends critically on the negative examples used during training. This is explored in depth in Hard Negative Mining, which covers:
The hard negative problem
Evolution from static to dynamic mining
Denoising strategies
Curriculum learning
LLM-based generation
This is the primary bottleneck in dense retrieval research today.