Stage 1: Retrieval Methods
===========================

This section covers methods for the first stage of the RAG pipeline: efficiently retrieving 
candidate documents from large corpora.

.. toctree::
   :maxdepth: 2
   :caption: Stage 1 Topics:

   sparse
   dense_baselines
   hard_mining
   late_interaction
   hybrid
   pretraining
   joint_learning
   literature_survey/index

Overview
--------

Stage 1 retrieval must balance two competing demands:

* **Speed**: Process millions of documents in milliseconds
* **Recall**: Don't miss relevant documents

The solution is to use architectures that allow pre-computation and efficient similarity search.

Evolution of Stage 1 Methods
-----------------------------

.. list-table:: Historical Evolution
   :header-rows: 1
   :widths: 15 25 30 30

   * - Era
     - Method Type
     - Key Innovation
     - Representative Papers
   * - Pre-2020
     - Sparse (BM25)
     - Inverted index, TF-IDF
     - Traditional IR
   * - 2020
     - Dense Baselines
     - Dual-encoder with BERT
     - DPR, RepBERT
   * - 2021
     - Hard Negatives
     - Dynamic mining, denoising
     - ANCE, RocketQA, ADORE
   * - 2021-2022
     - Late Interaction
     - Token-level representations
     - ColBERT, ColBERTv2
   * - 2022-2023
     - Sampling Strategies
     - Curriculum, score-based
     - TAS-Balanced, SimANS
   * - 2023-2024
     - LLM Integration
     - Synthetic negatives, prompting
     - SyNeg, LLM embeddings

Key Dimensions
--------------

When evaluating Stage 1 methods, consider:

**Architecture**

* **Dual-Encoder**: Independent encoding (fastest)
* **Late Interaction**: Token-level matching (more accurate)
* **Hybrid**: Combines sparse and dense

**Training Strategy**

* **Negative Mining**: How to find informative negatives?
* **Knowledge Distillation**: Learn from cross-encoder teachers
* **Curriculum Learning**: Progressive difficulty

**Index Structure**

* **Dense Vector**: Single vector per document
* **Multi-Vector**: Multiple vectors (e.g., ColBERT)
* **Learned Index**: End-to-end optimized structures

Quick Navigation
----------------

* :doc:`sparse` - BM25 and traditional IR methods
* :doc:`dense_baselines` - DPR, RepBERT (foundational papers)
* :doc:`hard_mining` - **Core focus**: Advanced negative mining strategies
* :doc:`late_interaction` - ColBERT and token-level methods
* :doc:`hybrid` - Combining sparse and dense
* :doc:`pretraining` - Pre-training strategies for dense retrievers
* :doc:`joint_learning` - Jointly optimizing retrieval and indexing

The Central Challenge: Hard Negative Mining
--------------------------------------------

The quality of Stage 1 retrievers depends critically on the negative examples used during 
training. This is explored in depth in :doc:`hard_mining`, which covers:

* The hard negative problem
* Evolution from static to dynamic mining
* Denoising strategies
* Curriculum learning
* LLM-based generation

This is the **primary bottleneck** in dense retrieval research today.