Pre-training Methods for Dense Retrievers¶
Standard dense retrievers start from general pre-trained models (BERT, RoBERTa) and fine-tune on retrieval tasks. However, specialized pre-training strategies can create better initialization, leading to stronger final performance.
Why Pre-training Matters for Retrieval¶
The Problem with Standard BERT
BERT was pre-trained for masked language modeling and next sentence prediction—tasks quite different from “is this passage relevant to this query?”
The Solution
Pre-train specifically for retrieval using:
Unsupervised contrastive learning on documents
Inverse tasks (generate query from passage)
Retrieval-augmented objectives
Corpus-aware objectives
Pre-training Methods Literature¶
Unsupervised Pre-training¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Latent Retrieval for Weakly Supervised Open Domain QA (ORQA) |
Lee et al. |
ACL 2019 |
Inverse Cloze Task (ICT): Pre-trains by predicting which passage a sentence came from. Generates pseudo-queries automatically. Unsupervised data generation at scale. |
|
Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever) |
Izacard et al. |
TMLR 2022 |
Contrastive + Augmentation: Robust unsupervised features via contrastive learning and aggressive data augmentation. State-of-the-art zero-shot retrieval. No labels needed! |
Supervised Retrieval-Augmented Pre-training¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Guu et al. |
ICML 2020 |
End-to-End Retrieval Pre-training: Jointly pre-trains retriever and language model. Index refreshed during pre-training. Computationally heavy but powerful for end-to-end QA. |
Architecture-Aware Pre-training¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Gao & Callan |
EMNLP 2021 |
Skip-Connection Head: Architectural modification that forces global information into CLS token. Makes CLS better suited for representing entire document. |
||
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval (coCondenser) |
Gao & Callan |
ACL 2022 |
Corpus-Aware Contrastive: Unsupervised contrastive learning at corpus level. Aligns document spans without labels. Strong zero-shot performance. |
Multi-View Pre-training¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Zhang et al. |
arXiv 2021 |
NA |
Multi-View Generation: Learns explicit views for different retrieval intents. Anti-collapse regularization prevents views from becoming identical. Handles diverse information needs. |
Comparison: Fine-tuning vs Pre-training¶
Dimension |
Standard Fine-tuning |
Specialized Pre-training |
|---|---|---|
Starting Point |
General BERT/RoBERTa |
Retrieval-optimized model |
CLS Token |
Optimized for MLM |
Optimized for full doc representation |
Zero-shot |
Poor (~0.3 MRR) |
Good (~0.45 MRR) |
Fine-tuning Data |
Needs more data |
Needs less data |
Training Cost |
Lower (fine-tune only) |
Higher (pre-train + fine-tune) |
Final Performance |
Good |
Better (+3-7%) |
When to Use Pre-trained Models¶
✅ Use Retrieval-Specific Pre-training When:
Zero-shot performance matters (new domains)
Limited fine-tuning data available
Want best possible final performance
Have computational budget for pre-training
✅ Use Pre-trained Models (from others):
Most practical approach: use existing pre-trained models:
# Modern pre-trained retrievers (recommended)
model = SentenceTransformer('BAAI/bge-base-en-v1.5') # Used coCondenser
model = SentenceTransformer('intfloat/e5-base-v2') # Multi-stage pre-training
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5') # Contriever-based
These are all pre-trained with retrieval-specific objectives and ready to fine-tune.
❌ Don’t Pre-train From Scratch Unless:
You have unique domain with massive unlabeled data
You’re doing research on pre-training itself
Existing models fail completely on your domain
Recommended Practice¶
For Most Projects:
Start with pre-trained retrieval model (e.g., BGE, E5)
Fine-tune on your specific task with good hard negatives
Result: 95% of custom pre-training performance, 10% of cost
For Research/Scale:
Use Contriever or coCondenser approach
Pre-train on your domain corpus (unsupervised)
Then fine-tune on labeled data
Result: Best possible performance, significant cost
Implementation Example¶
Using Pre-trained Contriever
from sentence_transformers import SentenceTransformer
# Load pre-trained with contrastive objective
model = SentenceTransformer('facebook/contriever')
# Already good zero-shot performance
results = model.search(query, corpus, top_k=10)
# Fine-tune on your data for even better performance
from sentence_transformers import losses, InputExample
train_examples = [
InputExample(texts=[query, positive_passage])
for query, positive_passage in train_data
]
loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3)
Next Steps¶
See Dense Baselines & Fixed Embeddings for standard fine-tuning approaches
See Hard Negative Mining for improving fine-tuning with better negatives
See Joint Learning of Retrieval and Indexing for jointly optimizing pre-training and indexing