Stage 1: Retrieval Methods

This section covers methods for the first stage of the RAG pipeline: efficiently retrieving candidate documents from large corpora.

Stage 1 Topics:

Overview

Stage 1 retrieval must balance two competing demands:

  • Speed: Process millions of documents in milliseconds

  • Recall: Don’t miss relevant documents

The solution is to use architectures that allow pre-computation and efficient similarity search.

Evolution of Stage 1 Methods

Historical Evolution

Era

Method Type

Key Innovation

Representative Papers

Pre-2020

Sparse (BM25)

Inverted index, TF-IDF

Traditional IR

2020

Dense Baselines

Dual-encoder with BERT

DPR, RepBERT

2021

Hard Negatives

Dynamic mining, denoising

ANCE, RocketQA, ADORE

2021-2022

Late Interaction

Token-level representations

ColBERT, ColBERTv2

2022-2023

Sampling Strategies

Curriculum, score-based

TAS-Balanced, SimANS

2023-2024

LLM Integration

Synthetic negatives, prompting

SyNeg, LLM embeddings

Key Dimensions

When evaluating Stage 1 methods, consider:

Architecture

  • Dual-Encoder: Independent encoding (fastest)

  • Late Interaction: Token-level matching (more accurate)

  • Hybrid: Combines sparse and dense

Training Strategy

  • Negative Mining: How to find informative negatives?

  • Knowledge Distillation: Learn from cross-encoder teachers

  • Curriculum Learning: Progressive difficulty

Index Structure

  • Dense Vector: Single vector per document

  • Multi-Vector: Multiple vectors (e.g., ColBERT)

  • Learned Index: End-to-end optimized structures

Quick Navigation

The Central Challenge: Hard Negative Mining

The quality of Stage 1 retrievers depends critically on the negative examples used during training. This is explored in depth in Hard Negative Mining, which covers:

  • The hard negative problem

  • Evolution from static to dynamic mining

  • Denoising strategies

  • Curriculum learning

  • LLM-based generation

This is the primary bottleneck in dense retrieval research today.