Pre-training Methods for Dense Retrievers

Standard dense retrievers start from general pre-trained models (BERT, RoBERTa) and fine-tune on retrieval tasks. However, specialized pre-training strategies can create better initialization, leading to stronger final performance.

Why Pre-training Matters for Retrieval

The Problem with Standard BERT

BERT was pre-trained for masked language modeling and next sentence prediction—tasks quite different from “is this passage relevant to this query?”

The Solution

Pre-train specifically for retrieval using:

  • Unsupervised contrastive learning on documents

  • Inverse tasks (generate query from passage)

  • Retrieval-augmented objectives

  • Corpus-aware objectives

Pre-training Methods Literature

Unsupervised Pre-training

Paper

Author

Venue

Code

Key Innovation

Latent Retrieval for Weakly Supervised Open Domain QA (ORQA)

Lee et al.

ACL 2019

Code

Inverse Cloze Task (ICT): Pre-trains by predicting which passage a sentence came from. Generates pseudo-queries automatically. Unsupervised data generation at scale.

Unsupervised Dense Information Retrieval with Contrastive Learning (Contriever)

Izacard et al.

TMLR 2022

Code

Contrastive + Augmentation: Robust unsupervised features via contrastive learning and aggressive data augmentation. State-of-the-art zero-shot retrieval. No labels needed!

Supervised Retrieval-Augmented Pre-training

Paper

Author

Venue

Code

Key Innovation

REALM: Retrieval-Augmented Language Model Pre-Training

Guu et al.

ICML 2020

Code

End-to-End Retrieval Pre-training: Jointly pre-trains retriever and language model. Index refreshed during pre-training. Computationally heavy but powerful for end-to-end QA.

Architecture-Aware Pre-training

Paper

Author

Venue

Code

Key Innovation

Condenser: a Pre-training Architecture for Dense Retrieval

Gao & Callan

EMNLP 2021

Code

Skip-Connection Head: Architectural modification that forces global information into CLS token. Makes CLS better suited for representing entire document.

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval (coCondenser)

Gao & Callan

ACL 2022

Code

Corpus-Aware Contrastive: Unsupervised contrastive learning at corpus level. Aligns document spans without labels. Strong zero-shot performance.

Multi-View Pre-training

Paper

Author

Venue

Code

Key Innovation

MVR: Multi-View Representation for Dense Retrieval

Zhang et al.

arXiv 2021

NA

Multi-View Generation: Learns explicit views for different retrieval intents. Anti-collapse regularization prevents views from becoming identical. Handles diverse information needs.

Comparison: Fine-tuning vs Pre-training

Dimension

Standard Fine-tuning

Specialized Pre-training

Starting Point

General BERT/RoBERTa

Retrieval-optimized model

CLS Token

Optimized for MLM

Optimized for full doc representation

Zero-shot

Poor (~0.3 MRR)

Good (~0.45 MRR)

Fine-tuning Data

Needs more data

Needs less data

Training Cost

Lower (fine-tune only)

Higher (pre-train + fine-tune)

Final Performance

Good

Better (+3-7%)

When to Use Pre-trained Models

Use Retrieval-Specific Pre-training When:

  • Zero-shot performance matters (new domains)

  • Limited fine-tuning data available

  • Want best possible final performance

  • Have computational budget for pre-training

Use Pre-trained Models (from others):

Most practical approach: use existing pre-trained models:

# Modern pre-trained retrievers (recommended)
model = SentenceTransformer('BAAI/bge-base-en-v1.5')  # Used coCondenser
model = SentenceTransformer('intfloat/e5-base-v2')     # Multi-stage pre-training
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')  # Contriever-based

These are all pre-trained with retrieval-specific objectives and ready to fine-tune.

Don’t Pre-train From Scratch Unless:

  • You have unique domain with massive unlabeled data

  • You’re doing research on pre-training itself

  • Existing models fail completely on your domain

Implementation Example

Using Pre-trained Contriever

from sentence_transformers import SentenceTransformer

# Load pre-trained with contrastive objective
model = SentenceTransformer('facebook/contriever')

# Already good zero-shot performance
results = model.search(query, corpus, top_k=10)

# Fine-tune on your data for even better performance
from sentence_transformers import losses, InputExample

train_examples = [
    InputExample(texts=[query, positive_passage])
    for query, positive_passage in train_data
]

loss = losses.MultipleNegativesRankingLoss(model)
model.fit(train_examples, epochs=3)

Next Steps