Hard Negative Mining¶
The quality of negative samples is the primary determinant of dense retrieval performance.
This section covers strategies for mining “hard” negatives—documents that are semantically similar to the query but not relevant. This is the core bottleneck in dense retrieval research and the main focus of modern improvements.
Important
The Arms Race Metaphor: Advanced negative mining is a co-evolutionary arms race between the retriever (student model being trained) and the sampler (mechanism for finding negatives). As the model improves, the negatives must get harder.
The Hard Negative Problem¶
See the detailed explanation in the main documentation: The Hard Negative Problem
Quick Summary:
Easy Negatives: Random, unrelated documents → Model learns too quickly, no challenge
Hard Negatives: Semantically similar but irrelevant → Forces fine-grained discrimination
False Negatives: Actually relevant but mislabeled → Actively damages training
The goal is to mine negatives in the “Goldilocks zone”: hard enough to be challenging, but not so hard that they’re actually false positives.
Evolution of Hard Negative Strategies¶
Generation |
Strategy |
Example Papers |
Pros |
Cons |
|---|---|---|---|---|
1st Gen |
Random / In-batch |
DPR, RepBERT |
Simple, fast |
Easy negatives |
2nd Gen |
Static BM25 |
DPR (enhanced) |
Lexically hard |
Stale quickly |
3rd Gen |
Dynamic ANN Refresh |
ANCE |
Always fresh |
Expensive |
4th Gen |
Cross-encoder Denoised |
RocketQA, PAIR |
Filters false negs |
Needs cross-encoder |
5th Gen |
Curriculum / Smart Sampling |
TAS-Balanced, SimANS |
Efficient |
Complex design |
6th Gen |
LLM-Synthetic |
SyNeg |
Perfect calibration |
LLM cost |
Hard Negative Mining Papers¶
Core Dynamic Mining Methods¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE) |
Xiong et al. |
ICLR 2021 |
Dynamic Index Refresh: Asynchronously refreshes ANN index using latest model checkpoint to find negatives hard for the current model state, not just initial model. |
|
RocketQA: An Optimized Training Approach to Dense Passage Retrieval |
Qu et al. |
NAACL 2021 |
Cross-Batch & Denoising: Uses cross-batch negatives (share across GPUs) and filters ~70% false negatives using cross-encoder teacher. Huge negative pool improves discrimination. |
|
Optimizing Dense Retrieval Model Training with Hard Negatives (ADORE) |
Zhan et al. |
SIGIR 2021 |
Query-side Finetuning: ADORE optimizes query encoder against fixed document index to generate dynamic negatives efficiently. STAR adds stabilization with random negatives. |
|
Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently |
Zhan et al. |
arXiv 2020 |
NA |
Comprehensive Analysis: Systematic study of training strategies including negative sampling, loss functions, and architecture choices. Essential reading for understanding trade-offs. |
Lu et al. |
arXiv 2020 |
NA |
Improved Contrast: Enhanced negative contrast mechanism for better discrimination between relevant and irrelevant passages. |
|
Learning Robust Dense Retrieval Models from Incomplete Relevance Labels |
Prakash et al. |
SIGIR 2021 |
Incomplete Labels: Robust training methodology that handles incomplete and noisy relevance labels, common in real-world datasets. |
Smart Sampling Strategies¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (TAS-Balanced) |
Hofstätter et al. |
SIGIR 2021 |
Topic Aware Sampling: Clusters queries into topics and samples negatives that are topologically related but distinct. Dual-teacher distillation (pairwise + in-batch). Trains on single GPU in <48h! |
|
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval |
Ren et al. |
ACL Findings 2021 |
NA |
Passage-Centric Loss: Enforces structure among passages themselves (not just query-passage). Uses cross-encoder teacher with confidence thresholds to denoise. |
SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval |
Zhou et al. |
EMNLP 2022 |
Ambiguous Zone Sampling: Samples negatives from “ambiguous zone”—ranked neither too high (false positive risk) nor too low (too easy). Avoids both extremes. |
|
Curriculum Learning for Dense Retrieval Distillation (CL-DRD) |
Zeng et al. |
SIGIR 2022 |
NA |
Curriculum-based Distillation: Trains on progressively harder negatives (easy → medium → hard). Progressive difficulty controls noise and stabilizes training. Cross-encoder teacher guidance. |
Efficiency and Scalability¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage (GradCache) |
Gao et al. |
RepL4NLP 2021 |
Memory-Efficient In-Batch: Gradient caching enables very large batch sizes (thousands) with constant memory. More in-batch negatives = harder negatives for free. |
|
Lindgren et al. |
NeurIPS 2021 |
Negative Cache: Caches negatives from previous batches for reuse. Amortizes cost of hard negative mining across steps. Mitigates stale negative problem through cache rotation. |
||
Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval |
Lu et al. |
EMNLP 2021 |
NA |
Multi-Stage Training: Synthetic pre-train → fine-tune → negative sampling. Progressive hardness across stages. Uses synthetic data and multi-stage refinement. |
Advanced False Negative Handling¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
EMNLP Authors |
EMNLP Findings 2023 |
NA |
Automatic Detection & Correction: Uses perplexity-based noise detection to identify false negatives. EMA (exponential moving average) model provides correction signal. |
|
arXiv Authors |
arXiv 2024 |
NA |
Confidence Regularization: Adds regularizer to NCE (noise contrastive estimation) loss that adjusts based on model’s confidence in negative labels. Softens penalties for uncertain negatives. |
|
TriSampler: A Better Negative Sampling Principle for Dense Retrieval |
arXiv Authors |
arXiv 2024 |
NA |
Quasi-Triangular Principle: Models the triangular relationship among query, positive, and negative. Ensures negatives are far from positive but informative relative to query. |
LLM-Enhanced Methods (Latest Frontier)¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval |
arXiv Authors |
arXiv 2024 |
NA |
LLM-Synthetic Generation: Uses LLMs to generate text that is semantically similar to positive but factually contradictory. Creates “perfect” hard negatives that may not exist in corpus. Risk: adversarial gap (model learns LLM detector). |
Key Implementation Strategies¶
For detailed implementation guidance, see the full documentation section on each strategy.
Strategy 1: ANCE-Style Dynamic Mining¶
Architecture: Two asynchronous processes
Training Process: Reads hard negatives, trains model, saves checkpoints
Generation Process: Loads latest checkpoint, re-encodes corpus, mines new negatives
When to Use: Large-scale, SOTA performance critical, have distributed infrastructure
Challenge: Catastrophic forgetting if refresh too frequent
Strategy 2: RocketQA-Style Cross-Batch Denoising¶
Architecture: Share negatives across GPUs + cross-encoder filtering
Collect negatives from all GPUs in cluster (huge pool)
Use cross-encoder to score and filter false negatives
Train bi-encoder on denoised set
When to Use: Multi-GPU setup, false negatives are major concern
Advantage: ~70% false negative detection improves quality significantly
Strategy 3: TAS-Balanced-Style Topic Clustering¶
Architecture: Cluster queries → sample from same cluster
Cluster queries by topic (e.g., k-means on embeddings)
For each query, sample negatives from same cluster
Use balanced margin loss to handle noise
When to Use: Want efficiency (single GPU), have query distribution
Advantage: Very efficient, matches ANCE performance in less time
Strategy 4: SimANS-Style Ambiguous Sampling¶
Architecture: Score-based filtering around positive
Retrieve top-k candidates with student model
Filter to “ambiguous zone”: sim ∈ [positive_score - margin, positive_score + margin]
These are hard but not false negatives
When to Use: Simple to implement, works well in practice
Advantage: Avoids extremes (too easy, too hard)
Bird’s Eye View: Which Strategy to Choose?¶
Strategy |
Complexity |
Performance |
Speed |
False Neg Rate |
Best For |
|---|---|---|---|---|---|
In-Batch (Baseline) |
Very Low |
Baseline |
Very Fast |
High (15-25%) |
Prototyping |
Static BM25 |
Low |
+5-8% |
Fast |
Medium (8-15%) |
First iteration |
ANCE (Dynamic) |
Very High |
SOTA |
Slow |
Low (3-6%) |
Research, Scale |
RocketQA (Denoised) |
High |
SOTA -2% |
Medium |
Very Low (<2%) |
Multi-GPU |
TAS-Balanced |
Medium |
SOTA -5% |
Fast |
Low (3-5%) |
Recommended |
SimANS |
Low |
SOTA -7% |
Fast |
Low (2-4%) |
Recommended |
SyNeg (LLM) |
Medium |
SOTA -3% |
Medium |
Very Low (<1%) |
Domain-specific |
Recommended Path for Practitioners¶
Phase 1: Baseline (Week 1)
Start with in-batch negatives to establish baseline:
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer('bert-base-uncased')
loss = losses.MultipleNegativesRankingLoss(model)
Phase 2: Static Hard Negatives (Week 2)
Add BM25-mined negatives for 5-8% improvement:
# Mine negatives offline
bm25_negatives = bm25.search(query, k=100)
hard_negs = [neg for neg in bm25_negatives if not is_positive(neg)]
Phase 3: Cross-Encoder Denoising (Week 3)
Implement 2-step pipeline for 10-15% improvement:
# Filter BM25 negatives with cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
scores = cross_encoder.predict([(q, neg) for neg in bm25_negatives])
denoised = [neg for neg, score in zip(bm25_negatives, scores) if score < 0.5]
Phase 4: Smart Sampling (Week 4)
Add score-based filtering (SimANS-style) for another 5-10%:
# Mine in "ambiguous zone"
pos_score = model.similarity(query, positive)
margin = 0.1
ambiguous_negatives = [
neg for neg in candidates
if pos_score - margin < model.similarity(query, neg) < pos_score + margin
]
Phase 5: Curriculum (Optional, Week 5)
Wrap in curriculum for stability:
# Stage 1: Easy (in-batch)
train(model, in_batch_data, epochs=2)
# Stage 2: Medium (BM25)
train(model, bm25_data, epochs=3)
# Stage 3: Hard (denoised)
train(model, denoised_data, epochs=5)
This progression gives you 85-90% of SOTA performance with manageable complexity.
Common Pitfalls¶
❌ Pitfall 1: Not measuring false negative rate
Assume 10-20% false negative rate in naive mining
Validate on subset with exhaustive labels
❌ Pitfall 2: Using only the hardest negatives
Balance hard and medium negatives
Too-hard negatives are likely false negatives
❌ Pitfall 3: Ignoring computational cost
ANCE requires 3-5x more compute than 2-step
Factor in infrastructure cost, not just metrics
❌ Pitfall 4: Not using curriculum
Curriculum is low-cost, high-reward
Almost always improves by 3-5%
❌ Pitfall 5: Treating all queries equally
Query difficulty varies widely
Consider query-specific thresholds
Next Steps¶
See Overview for the full explanation of the hard negative problem
See The Hard Negative Problem for detailed analysis of why baselines fail
See Dense Baselines & Fixed Embeddings for the foundational papers (DPR, RepBERT)
See Late Interaction (ColBERT) for ColBERT’s approach to negatives