Literature Overview¶
This section provides access to all research papers covered in this documentation, organized by their role in the retrieval and re-ranking pipeline.
Note
Papers are now organized by topic! Instead of one large table, papers are distributed across focused sections below. This makes it easier to find papers relevant to your specific interest.
Papers by Stage¶
Stage 1: Retrieval¶
Papers focused on efficiently retrieving candidates from large corpora.
By Topic:
Sparse Methods (Sparse Retrieval Methods)
BM25 and traditional IR
Dense Baselines (Dense Baselines & Fixed Embeddings)
DPR (Karpukhin et al., EMNLP 2020)
RepBERT (Zhan et al., arXiv 2020)
Hard Negative Mining (Hard Negative Mining)
ANCE (Xiong et al., ICLR 2021) - Dynamic index refresh
RocketQA (Qu et al., NAACL 2021) - Cross-batch denoising
ADORE (Zhan et al., SIGIR 2021) - Query-side finetuning
TAS-Balanced (Hofstätter et al., SIGIR 2021) - Topic-aware sampling
SimANS (Zhou et al., EMNLP 2022) - Ambiguous negatives
GradCache (Gao et al., RepL4NLP 2021) - Memory-efficient training
CL-DRD (Zeng et al., SIGIR 2022) - Curriculum learning
SyNeg (arXiv 2024) - LLM-driven synthetic negatives
And many more…
Late Interaction (Late Interaction (ColBERT))
ColBERT (Khattab & Zaharia, SIGIR 2020)
ColBERTv2 (Santhanam et al., NAACL 2022)
Poly-encoders (Humeau et al., ICLR 2020)
Hybrid Methods (Hybrid Dense-Sparse Methods)
DENSPI (Seo et al., ACL 2019)
Semantic Residual (Gao et al., ECIR 2021)
DensePhrases (Lee et al., ACL 2021)
Pre-training (Pre-training Methods for Dense Retrievers)
ORQA/ICT (Lee et al., ACL 2019)
REALM (Guu et al., ICML 2020)
Condenser (Gao & Callan, EMNLP 2021)
coCondenser (Gao & Callan, ACL 2022)
Contriever (Izacard et al., TMLR 2022)
Joint Learning (Joint Learning of Retrieval and Indexing)
JPQ (Zhan et al., CIKM 2021)
EHI/Poeem (arXiv 2023)
Stage 2: Re-ranking¶
Papers focused on precision scoring of candidates.
By Topic:
Cross-Encoders (Cross-Encoders for Re-ranking)
BERT Cross-Encoder
MonoT5 / RankT5
Training strategies
LLM Re-rankers (LLM-Based Re-rankers)
RankGPT
RankLlama
Zero-shot prompting approaches
Papers by Research Theme¶
By Key Innovation¶
Hard Negative Mining
The core bottleneck in dense retrieval—see Hard Negative Mining for:
Dynamic mining (ANCE)
Cross-encoder denoising (RocketQA)
Score-based sampling (SimANS)
Curriculum learning (CL-DRD)
LLM synthesis (SyNeg)
False Negative Handling
Methods that address the damaging effects of false negatives:
RocketQA: Cross-encoder filtering (~70% detection rate)
TAS-Balanced: Balanced margin reduces noise
Noisy Pair Corrector: Perplexity-based detection
CCR: Confidence regularization
TriSampler: Triangular relationship modeling
Training Efficiency
Methods that reduce computational cost:
GradCache: Memory-efficient large batches
Negative Cache: Amortized hard negative mining
TAS-Balanced: Single GPU training (<48h)
ADORE: Fixed document encoder
JPQ: Joint query-index optimization
Knowledge Distillation
Using strong teachers to train fast students:
RocketQA: Cross-encoder teacher
PAIR: Passage-centric similarity
TAS-Balanced: Dual-teacher (pairwise + in-batch)
ColBERTv2: Denoised supervision
CL-DRD: Curriculum distillation
By Dataset/Domain¶
Papers organized by evaluation dataset:
MS MARCO: Most papers (standard benchmark)
Natural Questions: DPR, REALM, ORQA
BEIR (zero-shot): Contriever, coCondenser, BGE
Domain-specific: Legal, medical, code search
Complete Chronological Timeline¶
2019
ORQA (Lee et al., ACL)
DENSPI (Seo et al., ACL)
Poly-encoders (Humeau et al., ICLR)
2020
DPR (Karpukhin et al., EMNLP) - The foundation
RepBERT (Zhan et al., arXiv)
REALM (Guu et al., ICML)
ANCE (Xiong et al., ICLR 2021, arXiv 2020)
RocketQA (Qu et al., NAACL 2021, arXiv 2020)
ColBERT (Khattab & Zaharia, SIGIR)
2021
TAS-Balanced (Hofstätter et al., SIGIR)
ADORE (Zhan et al., SIGIR)
PAIR (Ren et al., ACL Findings)
GradCache (Gao et al., RepL4NLP)
Negative Cache (Lindgren et al., NeurIPS)
Condenser (Gao & Callan, EMNLP)
DensePhrases (Lee et al., ACL)
JPQ (Zhan et al., CIKM)
2022
SimANS (Zhou et al., EMNLP)
CL-DRD (Zeng et al., SIGIR)
ColBERTv2 (Santhanam et al., NAACL)
coCondenser (Gao & Callan, ACL)
Contriever (Izacard et al., TMLR)
2023
Noisy Pair Corrector (EMNLP Findings)
EHI/Poeem (arXiv)
BGE (BAAI) - State-of-the-art embedding models
E5-Mistral (Microsoft) - LLM-based embeddings
2024
CCR (arXiv)
TriSampler (arXiv)
SyNeg (arXiv)
LLM2Vec (McGill) - Converting LLMs to text encoders
BGE-M3 (BAAI) - Multi-lingual, multi-granularity embeddings
Jina Embeddings v3 (Jina AI) - 8K context window embeddings
NV-Embed (NVIDIA) - Generalist embedding model
Contributing New Papers¶
See Contributing for how to add new papers to this collection.
When adding papers, please categorize them appropriately:
Stage 1 or Stage 2?
What’s the key innovation?
Which section does it best fit?