Literature Overview¶

This section provides access to all research papers covered in this documentation, organized by their role in the retrieval and re-ranking pipeline.

Note

Papers are now organized by topic! Instead of one large table, papers are distributed across focused sections below. This makes it easier to find papers relevant to your specific interest.

Papers by Stage¶

Stage 1: Retrieval¶

Papers focused on efficiently retrieving candidates from large corpora.

By Topic:

Sparse Methods (Sparse Retrieval Methods)
- BM25 and traditional IR
Dense Baselines (Dense Baselines & Fixed Embeddings)
- DPR (Karpukhin et al., EMNLP 2020)
- RepBERT (Zhan et al., arXiv 2020)
Hard Negative Mining (Hard Negative Mining)
- ANCE (Xiong et al., ICLR 2021) - Dynamic index refresh
- RocketQA (Qu et al., NAACL 2021) - Cross-batch denoising
- ADORE (Zhan et al., SIGIR 2021) - Query-side finetuning
- TAS-Balanced (Hofstätter et al., SIGIR 2021) - Topic-aware sampling
- SimANS (Zhou et al., EMNLP 2022) - Ambiguous negatives
- GradCache (Gao et al., RepL4NLP 2021) - Memory-efficient training
- CL-DRD (Zeng et al., SIGIR 2022) - Curriculum learning
- SyNeg (arXiv 2024) - LLM-driven synthetic negatives
- And many more…
Late Interaction (Late Interaction (ColBERT))
- ColBERT (Khattab & Zaharia, SIGIR 2020)
- ColBERTv2 (Santhanam et al., NAACL 2022)
- Poly-encoders (Humeau et al., ICLR 2020)
Hybrid Methods (Hybrid Dense-Sparse Methods)
- DENSPI (Seo et al., ACL 2019)
- Semantic Residual (Gao et al., ECIR 2021)
- DensePhrases (Lee et al., ACL 2021)
Pre-training (Pre-training Methods for Dense Retrievers)
- ORQA/ICT (Lee et al., ACL 2019)
- REALM (Guu et al., ICML 2020)
- Condenser (Gao & Callan, EMNLP 2021)
- coCondenser (Gao & Callan, ACL 2022)
- Contriever (Izacard et al., TMLR 2022)
Joint Learning (Joint Learning of Retrieval and Indexing)
- JPQ (Zhan et al., CIKM 2021)
- EHI/Poeem (arXiv 2023)

Stage 2: Re-ranking¶

Papers focused on precision scoring of candidates.

By Topic:

Cross-Encoders (Cross-Encoders for Re-ranking)
- BERT Cross-Encoder
- MonoT5 / RankT5
- Training strategies
LLM Re-rankers (LLM-Based Re-rankers)
- RankGPT
- RankLlama
- Zero-shot prompting approaches

Papers by Research Theme¶

By Key Innovation¶

Hard Negative Mining

The core bottleneck in dense retrieval—see Hard Negative Mining for:

Dynamic mining (ANCE)
Cross-encoder denoising (RocketQA)
Score-based sampling (SimANS)
Curriculum learning (CL-DRD)
LLM synthesis (SyNeg)

False Negative Handling

Methods that address the damaging effects of false negatives:

RocketQA: Cross-encoder filtering (~70% detection rate)
TAS-Balanced: Balanced margin reduces noise
Noisy Pair Corrector: Perplexity-based detection
CCR: Confidence regularization
TriSampler: Triangular relationship modeling

Training Efficiency

Methods that reduce computational cost:

GradCache: Memory-efficient large batches
Negative Cache: Amortized hard negative mining
TAS-Balanced: Single GPU training (<48h)
ADORE: Fixed document encoder
JPQ: Joint query-index optimization

Knowledge Distillation

Using strong teachers to train fast students:

RocketQA: Cross-encoder teacher
PAIR: Passage-centric similarity
TAS-Balanced: Dual-teacher (pairwise + in-batch)
ColBERTv2: Denoised supervision
CL-DRD: Curriculum distillation

By Dataset/Domain¶

Papers organized by evaluation dataset:

MS MARCO: Most papers (standard benchmark)
Natural Questions: DPR, REALM, ORQA
BEIR (zero-shot): Contriever, coCondenser, BGE
Domain-specific: Legal, medical, code search

Complete Chronological Timeline¶

2019

ORQA (Lee et al., ACL)
DENSPI (Seo et al., ACL)
Poly-encoders (Humeau et al., ICLR)

2020

DPR (Karpukhin et al., EMNLP) - The foundation
RepBERT (Zhan et al., arXiv)
REALM (Guu et al., ICML)
ANCE (Xiong et al., ICLR 2021, arXiv 2020)
RocketQA (Qu et al., NAACL 2021, arXiv 2020)
ColBERT (Khattab & Zaharia, SIGIR)

2021

TAS-Balanced (Hofstätter et al., SIGIR)
ADORE (Zhan et al., SIGIR)
PAIR (Ren et al., ACL Findings)
GradCache (Gao et al., RepL4NLP)
Negative Cache (Lindgren et al., NeurIPS)
Condenser (Gao & Callan, EMNLP)
DensePhrases (Lee et al., ACL)
JPQ (Zhan et al., CIKM)

2022

SimANS (Zhou et al., EMNLP)
CL-DRD (Zeng et al., SIGIR)
ColBERTv2 (Santhanam et al., NAACL)
coCondenser (Gao & Callan, ACL)
Contriever (Izacard et al., TMLR)

2023

Noisy Pair Corrector (EMNLP Findings)
EHI/Poeem (arXiv)
BGE (BAAI) - State-of-the-art embedding models
E5-Mistral (Microsoft) - LLM-based embeddings

2024

CCR (arXiv)
TriSampler (arXiv)
SyNeg (arXiv)
LLM2Vec (McGill) - Converting LLMs to text encoders
BGE-M3 (BAAI) - Multi-lingual, multi-granularity embeddings
Jina Embeddings v3 (Jina AI) - 8K context window embeddings
NV-Embed (NVIDIA) - Generalist embedding model

Quick Navigation¶

I want to improve my retrieval model’s accuracy:

→ Start with Hard Negative Mining

I’m building a system from scratch:

→ Read Overview of RAG and the Two-Stage Pipeline then Stage 1: Retrieval Methods

I need faster inference:

→ See Sparse Retrieval Methods (BM25) or Joint Learning of Retrieval and Indexing (compression)

I want better re-ranking:

→ See Cross-Encoders for Re-ranking or LLM-Based Re-rankers

I’m doing research on training techniques:

→ Deep dive into Hard Negative Mining and Overview (theory)

Contributing New Papers¶

See Contributing for how to add new papers to this collection.

When adding papers, please categorize them appropriately:

Stage 1 or Stage 2?
What’s the key innovation?
Which section does it best fit?