Pre-training Methods for Dense Retrievers¶

Standard dense retrievers start from general pre-trained models (BERT, RoBERTa) and fine-tune on retrieval tasks. However, specialized pre-training strategies can create better initialization, leading to stronger final performance.

Why Pre-training Matters for Retrieval¶

The Problem with Standard BERT

BERT was pre-trained for masked language modeling and next sentence prediction—tasks quite different from “is this passage relevant to this query?”

The Solution

Pre-train specifically for retrieval using:

Unsupervised contrastive learning on documents
Inverse tasks (generate query from passage)
Retrieval-augmented objectives
Corpus-aware objectives

Pre-training Methods Literature¶

Unsupervised Pre-training¶

Paper	Author	Venue	Code	Key Innovation
Latent Retrieval for Weakly Supervised Open Domain QA (ORQA)	Lee et al.	ACL 2019	Code	Inverse Cloze Task (ICT): Pre-trains by predicting which passage a sentence came from. Generates pseudo-queries automatically. Unsupervised data generation at scale.
Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever)	Izacard et al.	TMLR 2022	Code	Contrastive + Augmentation: Robust unsupervised features via contrastive learning and aggressive data augmentation. State-of-the-art zero-shot retrieval. No labels needed!

Supervised Retrieval-Augmented Pre-training¶

Paper	Author	Venue	Code	Key Innovation
REALM: Retrieval-Augmented Language Model Pre-Training	Guu et al.	ICML 2020	Code	End-to-End Retrieval Pre-training: Jointly pre-trains retriever and language model. Index refreshed during pre-training. Computationally heavy but powerful for end-to-end QA.

Architecture-Aware Pre-training¶

Paper	Author	Venue	Code	Key Innovation
Condenser: a Pre-training Architecture for Dense Retrieval	Gao & Callan	EMNLP 2021	Code	Skip-Connection Head: Architectural modification that forces global information into CLS token. Makes CLS better suited for representing entire document.
Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval (coCondenser)	Gao & Callan	ACL 2022	Code	Corpus-Aware Contrastive: Unsupervised contrastive learning at corpus level. Aligns document spans without labels. Strong zero-shot performance.

Multi-View Pre-training¶

Paper	Author	Venue	Code	Key Innovation
MVR: Multi-View Representation for Dense Retrieval	Zhang et al.	arXiv 2021	NA	Multi-View Generation: Learns explicit views for different retrieval intents. Anti-collapse regularization prevents views from becoming identical. Handles diverse information needs.

Comparison: Fine-tuning vs Pre-training¶

Dimension	Standard Fine-tuning	Specialized Pre-training
Starting Point	General BERT/RoBERTa	Retrieval-optimized model
CLS Token	Optimized for MLM	Optimized for full doc representation
Zero-shot	Poor (~0.3 MRR)	Good (~0.45 MRR)
Fine-tuning Data	Needs more data	Needs less data
Training Cost	Lower (fine-tune only)	Higher (pre-train + fine-tune)
Final Performance	Good	Better (+3-7%)

When to Use Pre-trained Models¶

✅ Use Retrieval-Specific Pre-training When:

Zero-shot performance matters (new domains)
Limited fine-tuning data available
Want best possible final performance
Have computational budget for pre-training

✅ Use Pre-trained Models (from others):

Most practical approach: use existing pre-trained models:

# Modern pre-trained retrievers (recommended)
model = SentenceTransformer('BAAI/bge-base-en-v1.5')  # Used coCondenser
model = SentenceTransformer('intfloat/e5-base-v2')     # Multi-stage pre-training
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')  # Contriever-based

These are all pre-trained with retrieval-specific objectives and ready to fine-tune.

❌ Don’t Pre-train From Scratch Unless:

You have unique domain with massive unlabeled data
You’re doing research on pre-training itself
Existing models fail completely on your domain

Recommended Practice¶

For Most Projects:

Start with pre-trained retrieval model (e.g., BGE, E5)
Fine-tune on your specific task with good hard negatives
Result: 95% of custom pre-training performance, 10% of cost

For Research/Scale:

Use Contriever or coCondenser approach
Pre-train on your domain corpus (unsupervised)
Then fine-tune on labeled data
Result: Best possible performance, significant cost

Implementation Example¶

Using Pre-trained Contriever

from sentence_transformers import SentenceTransformer

# Load pre-trained with contrastive objective
model = SentenceTransformer('facebook/contriever')

# Already good zero-shot performance
results = model.search(query, corpus, top_k=10)

# Fine-tune on your data for even better performance
from sentence_transformers import losses, InputExample

train_examples = [
    InputExample(texts=[query, positive_passage])
    for query, positive_passage in train_data
]

loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3)

Next Steps¶

See Dense Baselines & Fixed Embeddings for standard fine-tuning approaches
See Hard Negative Mining for improving fine-tuning with better negatives
See Joint Learning of Retrieval and Indexing for jointly optimizing pre-training and indexing