Dense Baselines & Fixed Embeddings¶
This section covers the foundational papers that established dense retrieval as a viable alternative to sparse methods like BM25.
The Dense Retrieval Revolution¶
Before 2020, information retrieval was dominated by BM25. The key innovation of dense retrieval was to use pre-trained language models (BERT) to create semantic embeddings that could match queries and documents by meaning, not just keywords.
Dense Passage Retrieval (DPR)¶
The Foundation Paper
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Karpukhin et al. |
EMNLP 2020 |
In-batch + BM25 Static: The standard dual-encoder baseline using in-batch negatives and static BM25 hard negatives. Established that dense retrieval can outperform BM25. |
Key Components
Architecture: Dual-encoder (separate BERT for query and passage)
Training: In-batch negatives + BM25-mined hard negatives
Similarity: Dot product of embeddings
Index: FAISS for fast approximate nearest neighbor search
Why It Worked
Pre-trained BERT captures semantic meaning
Hard negatives (from BM25) force discrimination
Efficient indexing makes it practical
Code Example
from transformers import DPRQuestionEncoder, DPRContextEncoder
import torch
# Load DPR models
q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
# Encode
query = "What is the capital of France?"
query_emb = q_encoder(**tokenizer(query, return_tensors="pt")).pooler_output
# Search (using pre-computed passage embeddings + FAISS)
scores, indices = index.search(query_emb.numpy(), k=100)
Fixed Embeddings: RepBERT¶
Extreme Efficiency
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval |
Zhan et al. |
arXiv 2020 |
Contextualized Fixed-Length: Fixed-length embeddings with contextualization. Achieves efficiency comparable to bag-of-words while maintaining semantic understanding. |
Key Innovation
RepBERT showed that you could get dense retrieval quality with near-BM25 speed by:
Pre-computing all passage embeddings offline
Using highly optimized indexing (quantization, compression)
Simple dot product similarity (no expensive operations)
Performance vs Speed
Speed →
BM25: ████████████████████ (fastest, ~1ms)
RepBERT: ███████████████ (fast, ~5ms)
DPR: ██████████ (medium, ~10ms)
ColBERT: ████ (slower, ~50ms)
Cross-Encoder: █ (slowest, ~1000ms per doc)
Comparison: DPR vs RepBERT¶
Dimension |
DPR |
RepBERT |
|---|---|---|
Architecture |
Dual BERT encoders |
Single shared BERT |
Embedding Size |
768-d (BERT hidden) |
768-d (BERT hidden) |
Training |
In-batch + BM25 negatives |
In-batch only |
Speed |
~10ms per query |
~5ms per query |
Index Size |
Standard (4 bytes/dim) |
Can be quantized heavily |
Best For |
Accuracy |
Speed/efficiency |
When to Use Each¶
Use DPR When:
Accuracy is more important than speed
You have good hard negative mining
Standard FAISS index is acceptable
Following best practices from literature
Use RepBERT When:
Speed is critical (near-BM25 performance needed)
Index size must be minimal (e.g., edge deployment)
Don’t have resources for hard negative mining
Want simplest possible dense retrieval
Modern Successors¶
Both DPR and RepBERT have been superseded by more advanced methods, but they remain important baselines. Modern alternatives include:
- Contriever (Facebook AI, 2022)
Unsupervised dense retrieval with contrastive learning. No labels needed!
- BGE (BAAI, 2023)
State-of-the-art dense retriever with advanced hard negative mining.
- E5 (Microsoft, 2023)
Multi-stage pre-training with massive scale (hundreds of millions of pairs).
- Nomic-Embed (Nomic AI, 2024)
Open-source, high-quality embeddings with permissive license.
Implementation Recommendations¶
For New Projects in 2024
Don’t implement DPR/RepBERT from scratch. Instead:
from sentence_transformers import SentenceTransformer
# Use modern pre-trained model
model = SentenceTransformer('BAAI/bge-base-en-v1.5') # Better than DPR
# Or for speed
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Better than RepBERT
Why?
Pre-trained on larger datasets
Better hard negative mining during training
Optimized inference
Active maintenance
But Still Study DPR/RepBERT Because:
Understand foundational architecture
Baseline for your own research
Many papers compare against them
Core concepts still apply
Next Steps¶
See Hard Negative Mining for how modern methods improve over DPR’s static negatives
See Late Interaction (ColBERT) for ColBERT’s approach to more expressive representations
See Pre-training Methods for Dense Retrievers for methods that improve the base encoders