Literature Overview

This section provides access to all research papers covered in this documentation, organized by their role in the retrieval and re-ranking pipeline.

Note

Papers are now organized by topic! Instead of one large table, papers are distributed across focused sections below. This makes it easier to find papers relevant to your specific interest.

Papers by Stage

Stage 1: Retrieval

Papers focused on efficiently retrieving candidates from large corpora.

By Topic:

  • Sparse Methods (Sparse Retrieval Methods)

    • BM25 and traditional IR

  • Dense Baselines (Dense Baselines & Fixed Embeddings)

    • DPR (Karpukhin et al., EMNLP 2020)

    • RepBERT (Zhan et al., arXiv 2020)

  • Hard Negative Mining (Hard Negative Mining)

    • ANCE (Xiong et al., ICLR 2021) - Dynamic index refresh

    • RocketQA (Qu et al., NAACL 2021) - Cross-batch denoising

    • ADORE (Zhan et al., SIGIR 2021) - Query-side finetuning

    • TAS-Balanced (Hofstätter et al., SIGIR 2021) - Topic-aware sampling

    • SimANS (Zhou et al., EMNLP 2022) - Ambiguous negatives

    • GradCache (Gao et al., RepL4NLP 2021) - Memory-efficient training

    • CL-DRD (Zeng et al., SIGIR 2022) - Curriculum learning

    • SyNeg (arXiv 2024) - LLM-driven synthetic negatives

    • And many more…

  • Late Interaction (Late Interaction (ColBERT))

    • ColBERT (Khattab & Zaharia, SIGIR 2020)

    • ColBERTv2 (Santhanam et al., NAACL 2022)

    • Poly-encoders (Humeau et al., ICLR 2020)

  • Hybrid Methods (Hybrid Dense-Sparse Methods)

    • DENSPI (Seo et al., ACL 2019)

    • Semantic Residual (Gao et al., ECIR 2021)

    • DensePhrases (Lee et al., ACL 2021)

  • Pre-training (Pre-training Methods for Dense Retrievers)

    • ORQA/ICT (Lee et al., ACL 2019)

    • REALM (Guu et al., ICML 2020)

    • Condenser (Gao & Callan, EMNLP 2021)

    • coCondenser (Gao & Callan, ACL 2022)

    • Contriever (Izacard et al., TMLR 2022)

  • Joint Learning (Joint Learning of Retrieval and Indexing)

    • JPQ (Zhan et al., CIKM 2021)

    • EHI/Poeem (arXiv 2023)

Stage 2: Re-ranking

Papers focused on precision scoring of candidates.

By Topic:

Papers by Research Theme

By Key Innovation

Hard Negative Mining

The core bottleneck in dense retrieval—see Hard Negative Mining for:

  • Dynamic mining (ANCE)

  • Cross-encoder denoising (RocketQA)

  • Score-based sampling (SimANS)

  • Curriculum learning (CL-DRD)

  • LLM synthesis (SyNeg)

False Negative Handling

Methods that address the damaging effects of false negatives:

  • RocketQA: Cross-encoder filtering (~70% detection rate)

  • TAS-Balanced: Balanced margin reduces noise

  • Noisy Pair Corrector: Perplexity-based detection

  • CCR: Confidence regularization

  • TriSampler: Triangular relationship modeling

Training Efficiency

Methods that reduce computational cost:

  • GradCache: Memory-efficient large batches

  • Negative Cache: Amortized hard negative mining

  • TAS-Balanced: Single GPU training (<48h)

  • ADORE: Fixed document encoder

  • JPQ: Joint query-index optimization

Knowledge Distillation

Using strong teachers to train fast students:

  • RocketQA: Cross-encoder teacher

  • PAIR: Passage-centric similarity

  • TAS-Balanced: Dual-teacher (pairwise + in-batch)

  • ColBERTv2: Denoised supervision

  • CL-DRD: Curriculum distillation

By Dataset/Domain

Papers organized by evaluation dataset:

  • MS MARCO: Most papers (standard benchmark)

  • Natural Questions: DPR, REALM, ORQA

  • BEIR (zero-shot): Contriever, coCondenser, BGE

  • Domain-specific: Legal, medical, code search

Complete Chronological Timeline

2019

  • ORQA (Lee et al., ACL)

  • DENSPI (Seo et al., ACL)

  • Poly-encoders (Humeau et al., ICLR)

2020

  • DPR (Karpukhin et al., EMNLP) - The foundation

  • RepBERT (Zhan et al., arXiv)

  • REALM (Guu et al., ICML)

  • ANCE (Xiong et al., ICLR 2021, arXiv 2020)

  • RocketQA (Qu et al., NAACL 2021, arXiv 2020)

  • ColBERT (Khattab & Zaharia, SIGIR)

2021

  • TAS-Balanced (Hofstätter et al., SIGIR)

  • ADORE (Zhan et al., SIGIR)

  • PAIR (Ren et al., ACL Findings)

  • GradCache (Gao et al., RepL4NLP)

  • Negative Cache (Lindgren et al., NeurIPS)

  • Condenser (Gao & Callan, EMNLP)

  • DensePhrases (Lee et al., ACL)

  • JPQ (Zhan et al., CIKM)

2022

  • SimANS (Zhou et al., EMNLP)

  • CL-DRD (Zeng et al., SIGIR)

  • ColBERTv2 (Santhanam et al., NAACL)

  • coCondenser (Gao & Callan, ACL)

  • Contriever (Izacard et al., TMLR)

2023

  • Noisy Pair Corrector (EMNLP Findings)

  • EHI/Poeem (arXiv)

  • BGE (BAAI) - State-of-the-art embedding models

  • E5-Mistral (Microsoft) - LLM-based embeddings

2024

  • CCR (arXiv)

  • TriSampler (arXiv)

  • SyNeg (arXiv)

  • LLM2Vec (McGill) - Converting LLMs to text encoders

  • BGE-M3 (BAAI) - Multi-lingual, multi-granularity embeddings

  • Jina Embeddings v3 (Jina AI) - 8K context window embeddings

  • NV-Embed (NVIDIA) - Generalist embedding model

Quick Navigation

I want to improve my retrieval model’s accuracy:

→ Start with Hard Negative Mining

I’m building a system from scratch:

→ Read Overview of RAG and the Two-Stage Pipeline then Stage 1: Retrieval Methods

I need faster inference:

→ See Sparse Retrieval Methods (BM25) or Joint Learning of Retrieval and Indexing (compression)

I want better re-ranking:

→ See Cross-Encoders for Re-ranking or LLM-Based Re-rankers

I’m doing research on training techniques:

→ Deep dive into Hard Negative Mining and Overview (theory)

Contributing New Papers

See Contributing for how to add new papers to this collection.

When adding papers, please categorize them appropriately:

  • Stage 1 or Stage 2?

  • What’s the key innovation?

  • Which section does it best fit?