Dense Baselines & Fixed Embeddings¶

This section covers the foundational papers that established dense retrieval as a viable alternative to sparse methods like BM25.

The Dense Retrieval Revolution¶

Before 2020, information retrieval was dominated by BM25. The key innovation of dense retrieval was to use pre-trained language models (BERT) to create semantic embeddings that could match queries and documents by meaning, not just keywords.

Dense Passage Retrieval (DPR)¶

The Foundation Paper

Paper	Author	Venue	Code	Key Innovation
Dense Passage Retrieval for Open-Domain Question Answering	Karpukhin et al.	EMNLP 2020	Code	In-batch + BM25 Static: The standard dual-encoder baseline using in-batch negatives and static BM25 hard negatives. Established that dense retrieval can outperform BM25.

Key Components

Architecture: Dual-encoder (separate BERT for query and passage)
Training: In-batch negatives + BM25-mined hard negatives
Similarity: Dot product of embeddings
Index: FAISS for fast approximate nearest neighbor search

Why It Worked

Pre-trained BERT captures semantic meaning
Hard negatives (from BM25) force discrimination
Efficient indexing makes it practical

Code Example

from transformers import DPRQuestionEncoder, DPRContextEncoder
import torch

# Load DPR models
q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

# Encode
query = "What is the capital of France?"
query_emb = q_encoder(**tokenizer(query, return_tensors="pt")).pooler_output

# Search (using pre-computed passage embeddings + FAISS)
scores, indices = index.search(query_emb.numpy(), k=100)

Fixed Embeddings: RepBERT¶

Extreme Efficiency

Paper	Author	Venue	Code	Key Innovation
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval	Zhan et al.	arXiv 2020	Code	Contextualized Fixed-Length: Fixed-length embeddings with contextualization. Achieves efficiency comparable to bag-of-words while maintaining semantic understanding.

Key Innovation

RepBERT showed that you could get dense retrieval quality with near-BM25 speed by:

Pre-computing all passage embeddings offline
Using highly optimized indexing (quantization, compression)
Simple dot product similarity (no expensive operations)

Performance vs Speed

Speed →
BM25: ████████████████████ (fastest, ~1ms)
RepBERT: ███████████████ (fast, ~5ms)
DPR: ██████████ (medium, ~10ms)
ColBERT: ████ (slower, ~50ms)
Cross-Encoder: █ (slowest, ~1000ms per doc)

Comparison: DPR vs RepBERT¶

Dimension	DPR	RepBERT
Architecture	Dual BERT encoders	Single shared BERT
Embedding Size	768-d (BERT hidden)	768-d (BERT hidden)
Training	In-batch + BM25 negatives	In-batch only
Speed	~10ms per query	~5ms per query
Index Size	Standard (4 bytes/dim)	Can be quantized heavily
Best For	Accuracy	Speed/efficiency

When to Use Each¶

Use DPR When:

Accuracy is more important than speed
You have good hard negative mining
Standard FAISS index is acceptable
Following best practices from literature

Use RepBERT When:

Speed is critical (near-BM25 performance needed)
Index size must be minimal (e.g., edge deployment)
Don’t have resources for hard negative mining
Want simplest possible dense retrieval

Modern Successors¶

Both DPR and RepBERT have been superseded by more advanced methods, but they remain important baselines. Modern alternatives include:

Contriever (Facebook AI, 2022): Unsupervised dense retrieval with contrastive learning. No labels needed!
BGE (BAAI, 2023): State-of-the-art dense retriever with advanced hard negative mining.
E5 (Microsoft, 2023): Multi-stage pre-training with massive scale (hundreds of millions of pairs).
Nomic-Embed (Nomic AI, 2024): Open-source, high-quality embeddings with permissive license.

Implementation Recommendations¶

For New Projects in 2024

Don’t implement DPR/RepBERT from scratch. Instead:

from sentence_transformers import SentenceTransformer

# Use modern pre-trained model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')  # Better than DPR

# Or for speed
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')  # Better than RepBERT

Why?

Pre-trained on larger datasets
Better hard negative mining during training
Optimized inference
Active maintenance

But Still Study DPR/RepBERT Because:

Understand foundational architecture
Baseline for your own research
Many papers compare against them
Core concepts still apply

Next Steps¶

See Hard Negative Mining for how modern methods improve over DPR’s static negatives
See Late Interaction (ColBERT) for ColBERT’s approach to more expressive representations
See Pre-training Methods for Dense Retrievers for methods that improve the base encoders