Hard Negative Mining ==================== **The quality of negative samples is the primary determinant of dense retrieval performance.** This section covers strategies for mining "hard" negatives—documents that are semantically similar to the query but not relevant. This is the **core bottleneck** in dense retrieval research and the main focus of modern improvements. .. important:: **The Arms Race Metaphor**: Advanced negative mining is a co-evolutionary arms race between the retriever (student model being trained) and the sampler (mechanism for finding negatives). As the model improves, the negatives must get harder. The Hard Negative Problem -------------------------- See the detailed explanation in the main documentation: :doc:`../hard_negatives` **Quick Summary:** * **Easy Negatives**: Random, unrelated documents → Model learns too quickly, no challenge * **Hard Negatives**: Semantically similar but irrelevant → Forces fine-grained discrimination * **False Negatives**: Actually relevant but mislabeled → Actively damages training The goal is to mine negatives in the "Goldilocks zone": hard enough to be challenging, but not so hard that they're actually false positives. Evolution of Hard Negative Strategies -------------------------------------- .. list-table:: Six Generations of Negative Mining :header-rows: 1 :widths: 15 25 25 20 15 * - Generation - Strategy - Example Papers - Pros - Cons * - **1st Gen** - Random / In-batch - DPR, RepBERT - Simple, fast - Easy negatives * - **2nd Gen** - Static BM25 - DPR (enhanced) - Lexically hard - Stale quickly * - **3rd Gen** - Dynamic ANN Refresh - ANCE - Always fresh - Expensive * - **4th Gen** - Cross-encoder Denoised - RocketQA, PAIR - Filters false negs - Needs cross-encoder * - **5th Gen** - Curriculum / Smart Sampling - TAS-Balanced, SimANS - Efficient - Complex design * - **6th Gen** - LLM-Synthetic - SyNeg - Perfect calibration - LLM cost Hard Negative Mining Papers ---------------------------- Core Dynamic Mining Methods ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE) `_ - Xiong et al. - ICLR 2021 - `Code `_ - **Dynamic Index Refresh**: Asynchronously refreshes ANN index using latest model checkpoint to find negatives hard for the *current* model state, not just initial model. * - `RocketQA: An Optimized Training Approach to Dense Passage Retrieval `_ - Qu et al. - NAACL 2021 - `Code `_ - **Cross-Batch & Denoising**: Uses cross-batch negatives (share across GPUs) and filters ~70% false negatives using cross-encoder teacher. Huge negative pool improves discrimination. * - `Optimizing Dense Retrieval Model Training with Hard Negatives (ADORE) `_ - Zhan et al. - SIGIR 2021 - `Code `_ - **Query-side Finetuning**: ADORE optimizes query encoder against fixed document index to generate dynamic negatives efficiently. STAR adds stabilization with random negatives. * - `Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently `_ - Zhan et al. - arXiv 2020 - NA - **Comprehensive Analysis**: Systematic study of training strategies including negative sampling, loss functions, and architecture choices. Essential reading for understanding trade-offs. * - `Neural Passage Retrieval with Improved Negative Contrast `_ - Lu et al. - arXiv 2020 - NA - **Improved Contrast**: Enhanced negative contrast mechanism for better discrimination between relevant and irrelevant passages. * - `Learning Robust Dense Retrieval Models from Incomplete Relevance Labels `_ - Prakash et al. - SIGIR 2021 - `Code `_ - **Incomplete Labels**: Robust training methodology that handles incomplete and noisy relevance labels, common in real-world datasets. Smart Sampling Strategies ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling (TAS-Balanced) `_ - Hofstätter et al. - SIGIR 2021 - `Code `_ - **Topic Aware Sampling**: Clusters queries into topics and samples negatives that are topologically related but distinct. Dual-teacher distillation (pairwise + in-batch). Trains on single GPU in <48h! * - `PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval `_ - Ren et al. - ACL Findings 2021 - NA - **Passage-Centric Loss**: Enforces structure among passages themselves (not just query-passage). Uses cross-encoder teacher with confidence thresholds to denoise. * - `SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval `_ - Zhou et al. - EMNLP 2022 - `Code `_ - **Ambiguous Zone Sampling**: Samples negatives from "ambiguous zone"—ranked neither too high (false positive risk) nor too low (too easy). Avoids both extremes. * - `Curriculum Learning for Dense Retrieval Distillation (CL-DRD) `_ - Zeng et al. - SIGIR 2022 - NA - **Curriculum-based Distillation**: Trains on progressively harder negatives (easy → medium → hard). Progressive difficulty controls noise and stabilizes training. Cross-encoder teacher guidance. Efficiency and Scalability ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Scaling Deep Contrastive Learning Batch Size with Almost Constant Peak Memory Usage (GradCache) `_ - Gao et al. - RepL4NLP 2021 - `Code `_ - **Memory-Efficient In-Batch**: Gradient caching enables very large batch sizes (thousands) with constant memory. More in-batch negatives = harder negatives for free. * - `Efficient Training of Retrieval Models Using Negative Cache `_ - Lindgren et al. - NeurIPS 2021 - `Code `_ - **Negative Cache**: Caches negatives from previous batches for reuse. Amortizes cost of hard negative mining across steps. Mitigates stale negative problem through cache rotation. * - `Multi-stage Training with Improved Negative Contrast for Neural Passage Retrieval `_ - Lu et al. - EMNLP 2021 - NA - **Multi-Stage Training**: Synthetic pre-train → fine-tune → negative sampling. Progressive hardness across stages. Uses synthetic data and multi-stage refinement. Advanced False Negative Handling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Noisy Pair Corrector for Dense Retrieval `_ - EMNLP Authors - EMNLP Findings 2023 - NA - **Automatic Detection & Correction**: Uses perplexity-based noise detection to identify false negatives. EMA (exponential moving average) model provides correction signal. * - `Mitigating the Impact of False Negatives in Dense Retrieval with Contrastive Confidence Regularization (CCR) `_ - arXiv Authors - arXiv 2024 - NA - **Confidence Regularization**: Adds regularizer to NCE (noise contrastive estimation) loss that adjusts based on model's confidence in negative labels. Softens penalties for uncertain negatives. * - `TriSampler: A Better Negative Sampling Principle for Dense Retrieval `_ - arXiv Authors - arXiv 2024 - NA - **Quasi-Triangular Principle**: Models the triangular relationship among query, positive, and negative. Ensures negatives are far from positive but informative relative to query. LLM-Enhanced Methods (Latest Frontier) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval `_ - arXiv Authors - arXiv 2024 - NA - **LLM-Synthetic Generation**: Uses LLMs to generate text that is semantically similar to positive but factually contradictory. Creates "perfect" hard negatives that may not exist in corpus. Risk: adversarial gap (model learns LLM detector). Key Implementation Strategies ------------------------------ For detailed implementation guidance, see the full documentation section on each strategy. Strategy 1: ANCE-Style Dynamic Mining ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Architecture**: Two asynchronous processes 1. **Training Process**: Reads hard negatives, trains model, saves checkpoints 2. **Generation Process**: Loads latest checkpoint, re-encodes corpus, mines new negatives **When to Use**: Large-scale, SOTA performance critical, have distributed infrastructure **Challenge**: Catastrophic forgetting if refresh too frequent Strategy 2: RocketQA-Style Cross-Batch Denoising ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Architecture**: Share negatives across GPUs + cross-encoder filtering 1. Collect negatives from all GPUs in cluster (huge pool) 2. Use cross-encoder to score and filter false negatives 3. Train bi-encoder on denoised set **When to Use**: Multi-GPU setup, false negatives are major concern **Advantage**: ~70% false negative detection improves quality significantly Strategy 3: TAS-Balanced-Style Topic Clustering ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Architecture**: Cluster queries → sample from same cluster 1. Cluster queries by topic (e.g., k-means on embeddings) 2. For each query, sample negatives from same cluster 3. Use balanced margin loss to handle noise **When to Use**: Want efficiency (single GPU), have query distribution **Advantage**: Very efficient, matches ANCE performance in less time Strategy 4: SimANS-Style Ambiguous Sampling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Architecture**: Score-based filtering around positive 1. Retrieve top-k candidates with student model 2. Filter to "ambiguous zone": sim ∈ [positive_score - margin, positive_score + margin] 3. These are hard but not false negatives **When to Use**: Simple to implement, works well in practice **Advantage**: Avoids extremes (too easy, too hard) Bird's Eye View: Which Strategy to Choose? ------------------------------------------- .. list-table:: Strategy Comparison :header-rows: 1 :widths: 20 15 15 15 15 20 * - Strategy - Complexity - Performance - Speed - False Neg Rate - Best For * - In-Batch (Baseline) - Very Low - Baseline - Very Fast - High (15-25%) - Prototyping * - Static BM25 - Low - +5-8% - Fast - Medium (8-15%) - First iteration * - ANCE (Dynamic) - Very High - SOTA - Slow - Low (3-6%) - Research, Scale * - RocketQA (Denoised) - High - SOTA -2% - Medium - Very Low (<2%) - Multi-GPU * - TAS-Balanced - Medium - SOTA -5% - Fast - Low (3-5%) - **Recommended** * - SimANS - Low - SOTA -7% - Fast - Low (2-4%) - **Recommended** * - SyNeg (LLM) - Medium - SOTA -3% - Medium - Very Low (<1%) - Domain-specific Recommended Path for Practitioners ----------------------------------- **Phase 1: Baseline (Week 1)** Start with in-batch negatives to establish baseline: .. code-block:: python from sentence_transformers import SentenceTransformer, losses model = SentenceTransformer('bert-base-uncased') loss = losses.MultipleNegativesRankingLoss(model) **Phase 2: Static Hard Negatives (Week 2)** Add BM25-mined negatives for 5-8% improvement: .. code-block:: python # Mine negatives offline bm25_negatives = bm25.search(query, k=100) hard_negs = [neg for neg in bm25_negatives if not is_positive(neg)] **Phase 3: Cross-Encoder Denoising (Week 3)** Implement 2-step pipeline for 10-15% improvement: .. code-block:: python # Filter BM25 negatives with cross-encoder cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') scores = cross_encoder.predict([(q, neg) for neg in bm25_negatives]) denoised = [neg for neg, score in zip(bm25_negatives, scores) if score < 0.5] **Phase 4: Smart Sampling (Week 4)** Add score-based filtering (SimANS-style) for another 5-10%: .. code-block:: python # Mine in "ambiguous zone" pos_score = model.similarity(query, positive) margin = 0.1 ambiguous_negatives = [ neg for neg in candidates if pos_score - margin < model.similarity(query, neg) < pos_score + margin ] **Phase 5: Curriculum (Optional, Week 5)** Wrap in curriculum for stability: .. code-block:: python # Stage 1: Easy (in-batch) train(model, in_batch_data, epochs=2) # Stage 2: Medium (BM25) train(model, bm25_data, epochs=3) # Stage 3: Hard (denoised) train(model, denoised_data, epochs=5) This progression gives you **85-90% of SOTA performance** with manageable complexity. Common Pitfalls --------------- ❌ **Pitfall 1**: Not measuring false negative rate * Assume 10-20% false negative rate in naive mining * Validate on subset with exhaustive labels ❌ **Pitfall 2**: Using only the hardest negatives * Balance hard and medium negatives * Too-hard negatives are likely false negatives ❌ **Pitfall 3**: Ignoring computational cost * ANCE requires 3-5x more compute than 2-step * Factor in infrastructure cost, not just metrics ❌ **Pitfall 4**: Not using curriculum * Curriculum is low-cost, high-reward * Almost always improves by 3-5% ❌ **Pitfall 5**: Treating all queries equally * Query difficulty varies widely * Consider query-specific thresholds Next Steps ---------- * See :doc:`../dense_retrieval_paradigm` for the full explanation of the hard negative problem * See :doc:`../hard_negatives` for detailed analysis of why baselines fail * See :doc:`dense_baselines` for the foundational papers (DPR, RepBERT) * See :doc:`late_interaction` for ColBERT's approach to negatives