2. Primary Benchmarks¶
2.1 MS MARCO (Microsoft Machine Reading Comprehension)¶
Overview
The de facto standard benchmark for passage retrieval, derived from Bing search logs.
Property |
Value |
|---|---|
Corpus Size |
8,841,823 passages |
Training Queries |
502,939 (with ~1 relevant passage each) |
Dev Queries |
6,980 (official small dev set) |
Avg. Query Length |
5.96 words |
Avg. Passage Length |
55.98 words |
Annotation Type |
Sparse (typically 1 positive per query) |
Characteristics
Sparse Annotations: Most queries have only 1 labeled positive passage, despite multiple relevant passages existing in the corpus. This creates a significant false negative problem.
Web Domain: Queries reflect real user search behavior (navigational, informational, transactional)
Answer Extraction: Original task was extractive QA; passages contain answer spans
Evaluation Protocol
MRR@10: Primary metric (Mean Reciprocal Rank of first relevant result in top-10)
Recall@1000: Secondary metric for retrieval (measures candidate generation quality)
Known Limitations
Incomplete Labels: Estimated 30-50% of top-retrieved passages are unlabeled positives
Position Bias: Annotators saw BM25-ranked results, biasing labels toward lexical matches
Single Positive: Contrastive learning with only 1 positive limits training signal
Implications for Hard Negative Mining
The sparse annotation makes MS MARCO particularly challenging for hard negative mining:
Mining from top BM25/dense results has high false negative risk
Cross-encoder denoising is essential (RocketQA detected ~70% false negatives)
Models trained on MS MARCO may learn to avoid semantically similar passages
2.2 Natural Questions (NQ)¶
Overview
Google’s open-domain question answering dataset based on real search queries.
Property |
Value |
|---|---|
Corpus |
Wikipedia (21M passages in DPR split) |
Training Queries |
79,168 |
Dev Queries |
8,757 |
Test Queries |
3,610 |
Avg. Query Length |
9.2 words |
Annotation Type |
Dense (multiple annotators, long/short answers) |
Characteristics
Natural Queries: Real Google search queries (more natural than synthetic)
Wikipedia Corpus: Well-structured, factual content
Multiple Answer Types: Short answer spans + long answer paragraphs
Higher Quality Labels: Multiple annotators reduce noise
Evaluation Protocol
Top-k Accuracy: Whether any of top-k retrieved passages contains the answer
Recall@k: Standard retrieval recall
Exact Match (EM): For end-to-end QA evaluation
2.3 BEIR (Benchmarking IR)¶
Overview
The gold standard for zero-shot retrieval evaluation, comprising 18 diverse datasets.
Dataset |
Domain |
Corpus Size |
Task Description |
|---|---|---|---|
MS MARCO |
Web |
8.8M |
Passage retrieval from Bing |
TREC-COVID |
Biomedical |
171K |
COVID-19 scientific literature |
NFCorpus |
Biomedical |
3.6K |
Nutrition and medical |
NQ |
Wikipedia |
2.7M |
Open-domain QA |
HotpotQA |
Wikipedia |
5.2M |
Multi-hop reasoning |
FiQA |
Finance |
57K |
Financial QA |
ArguAna |
Misc |
8.7K |
Argument retrieval |
Touché-2020 |
Misc |
382K |
Argument retrieval (web) |
CQADupStack |
StackExchange |
457K |
Duplicate question detection |
Quora |
Web |
523K |
Duplicate question pairs |
DBPedia |
Wikipedia |
4.6M |
Entity retrieval |
SCIDOCS |
Scientific |
25K |
Citation prediction |
FEVER |
Wikipedia |
5.4M |
Fact verification |
Climate-FEVER |
Scientific |
5.4M |
Climate fact-checking |
SciFact |
Scientific |
5.2K |
Scientific claim verification |
Evaluation Protocol
Primary Metric: nDCG@10 (position-weighted relevance)
Aggregation: Average nDCG@10 across all 18 datasets
No Fine-tuning: Models must be evaluated without dataset-specific training
Why BEIR Matters for Hard Negative Mining
BEIR tests whether negative mining strategies produce generalizable representations:
Domain Diversity: Finance, biomedical, legal, scientific, web
Task Diversity: QA, fact-checking, argument retrieval, duplicate detection
Corpus Size Variation: 3.6K to 8.8M documents
Query Style Variation: Keywords, questions, claims, arguments
Note
Key Finding: Models with aggressive hard negative mining (e.g., ANCE with frequent refresh) sometimes show strong MS MARCO performance but weaker BEIR generalization. This suggests overfitting to MS MARCO’s specific negative distribution.
2.4 MTEB (Massive Text Embedding Benchmark)¶
Overview
Comprehensive benchmark covering 8 embedding tasks across 58+ datasets.
Task |
Metric |
Description |
|---|---|---|
Retrieval |
Information retrieval (includes BEIR) |
|
Reranking |
MAP |
Re-ordering candidate documents |
Classification |
Accuracy |
Text categorization |
Clustering |
V-measure |
Document grouping |
Pair Classification |
AP |
Binary relationship prediction |
STS |
Spearman ρ |
Semantic textual similarity |
Summarization |
Spearman ρ |
Summary quality scoring |
Bitext Mining |
F1 |
Cross-lingual alignment |
Relevance to Dense Retrieval Research
MTEB’s retrieval subset includes 15 BEIR datasets, making it the standard for embedding model comparison. The leaderboard at HuggingFace provides:
Standardized evaluation protocols
Fair comparison across models
Task-specific and aggregate rankings
2.5 RTEB (Retrieval Text Embedding Benchmark)¶
Overview
A newer benchmark specifically designed for retrieval, with private test sets to prevent data contamination.
Key Differentiators from MTEB
Feature |
MTEB |
RTEB |
|---|---|---|
Test Sets |
Public (risk of contamination) |
Private + Public |
Evaluation |
Local scripts |
Server-side submission |
Focus |
8 embedding tasks |
Retrieval only |
Metric |
Task-specific averages |
|
Leakage Risk |
High (test sets in training data) |
Low (private evaluation) |
Implications for Research
For Development: Use MTEB/BEIR for ablation studies and hyperparameter tuning
For Publication: Submit to RTEB for credible zero-shot generalization claims
For Hard Negative Mining: RTEB’s private sets better test true generalization