Hybrid Dense-Sparse Methods¶
Hybrid methods combine the strengths of both sparse (BM25) and dense (neural) retrieval to achieve robustness across different query types.
The Complementarity Principle¶
BM25 Strengths:
Exact keyword matching (entity names, IDs, codes)
No vocabulary mismatch for technical terms
Works well for rare terms
Dense Strengths:
Semantic matching (synonyms, paraphrases)
Handles natural language questions
Captures context and intent
Together: Cover both keyword-heavy and semantic queries.
Hybrid Methods Literature¶
Paper |
Author |
Venue |
Code |
Key Innovation |
|---|---|---|---|---|
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DENSPI) |
Seo et al. |
ACL 2019 |
Dense-Sparse Phrase Index: Combines dense vectors with sparse phrase matching. Real-time phrase-level retrieval. Stores billions of phrase representations for precise extraction. |
|
Complementing Lexical Retrieval with Semantic Residual Embedding |
Gao et al. |
ECIR 2021 |
NA |
Learn What BM25 Misses: Neural model learns semantic “residual”—what BM25 fails to capture. Highly efficient complementarity through orthogonalization objective. |
DensePhrases: Learning Dense Representations of Phrases at Scale |
Lee et al. |
ACL 2021 |
Dense Phrase Embeddings: Every phrase gets dense vector. Novel negative sampling at phrase level. Can serve as knowledge base for multi-hop reasoning. |
Implementation Patterns¶
Pattern 1: Score Fusion¶
Simplest approach: retrieve with both, combine scores.
# Retrieve from both indices
bm25_results = bm25.search(query, k=100)
dense_results = bi_encoder.search(query, k=100)
# Normalize scores to [0, 1]
bm25_scores = normalize(bm25_results.scores)
dense_scores = normalize(dense_results.scores)
# Weighted fusion
alpha = 0.5 # Tunable weight
for doc_id in union(bm25_results, dense_results):
combined_score = alpha * bm25_scores[doc_id] + (1 - alpha) * dense_scores[doc_id]
# Rank by combined score
final_results = rank_by_score(combined_score)[:100]
Tuning alpha: Use validation set to find optimal weight (typically 0.3-0.7).
Pattern 2: Cascade Filtering¶
Use fast method first, then refine with slow method.
# Stage 1a: BM25 (very fast, cast wide net)
bm25_candidates = bm25.search(query, k=10000)
# Stage 1b: Dense re-rank to top-100
dense_scores = bi_encoder.score(query, bm25_candidates)
top_100 = rank_by_score(bm25_candidates, dense_scores)[:100]
Advantage: BM25 is so fast that retrieving 10K docs costs ~5ms extra but improves recall.
Pattern 3: Semantic Residual¶
Train neural model to capture what BM25 misses.
# From Gao et al., ECIR 2021
# Loss encourages dense model to be orthogonal to BM25
loss = contrastive_loss(query, pos, neg) \\
+ lambda * orthogonality_loss(dense_scores, bm25_scores)
# At inference: dense focuses on semantic gaps
When to Use Hybrid¶
Scenario |
Recommendation |
|---|---|
Mixed Query Types |
Use hybrid (some queries keyword-heavy, some semantic) |
Technical Domains |
Use hybrid (entity names need exact match, concepts need semantics) |
Maximum Recall |
Use hybrid (BM25 catches what dense misses and vice versa) |
Unknown Query Distribution |
Start with hybrid (safer than choosing one) |
Homogeneous Semantic Queries |
Dense only (hybrid overhead not worth it) |
Pure Keyword Search |
BM25 only (faster, simpler) |
Empirical Results¶
Typical improvements from hybrid over single method:
Dataset: MS MARCO Dev
BM25 only: MRR@10 = 0.187
Dense only: MRR@10 = 0.311
Hybrid (α=0.5): MRR@10 = 0.336 (+8% over dense alone!)
Dataset: Natural Questions
BM25 only: Recall@100 = 0.73
Dense only: Recall@100 = 0.85
Hybrid (α=0.4): Recall@100 = 0.89 (+4.7% over dense alone)
Pattern: Hybrid helps most when query distribution is diverse.
Implementation Resources¶
Libraries with Hybrid Support
# Haystack framework
from haystack.nodes import BM25Retriever, EmbeddingRetriever
from haystack.pipelines import Pipeline
bm25 = BM25Retriever(document_store)
dense = EmbeddingRetriever(document_store, model="BAAI/bge-base-en-v1.5")
# Haystack automatically fuses scores
# LlamaIndex
from llama_index import VectorStoreIndex, SimpleKeywordTableIndex
from llama_index.query_engine import RetrieverQueryEngine
# Combines vector and keyword retrieval
Best Practices¶
Always tune the fusion weight (alpha) on validation set
Normalize scores before combining (different ranges)
Consider query type classification: route to BM25 or dense based on query
Monitor both components: ensure neither is degraded
Evaluate on diverse benchmark (BEIR has 18 datasets)
Next Steps¶
See Sparse Retrieval Methods for detailed BM25 explanation
See Dense Baselines & Fixed Embeddings for dense retrieval fundamentals
See Hard Negative Mining for improving dense component
See Late Interaction (ColBERT) for ColBERT’s unified approach