Hybrid Dense-Sparse Methods

Hybrid methods combine the strengths of both sparse (BM25) and dense (neural) retrieval to achieve robustness across different query types.

The Complementarity Principle

BM25 Strengths:

  • Exact keyword matching (entity names, IDs, codes)

  • No vocabulary mismatch for technical terms

  • Works well for rare terms

Dense Strengths:

  • Semantic matching (synonyms, paraphrases)

  • Handles natural language questions

  • Captures context and intent

Together: Cover both keyword-heavy and semantic queries.

Hybrid Methods Literature

Paper

Author

Venue

Code

Key Innovation

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DENSPI)

Seo et al.

ACL 2019

Code

Dense-Sparse Phrase Index: Combines dense vectors with sparse phrase matching. Real-time phrase-level retrieval. Stores billions of phrase representations for precise extraction.

Complementing Lexical Retrieval with Semantic Residual Embedding

Gao et al.

ECIR 2021

NA

Learn What BM25 Misses: Neural model learns semantic “residual”—what BM25 fails to capture. Highly efficient complementarity through orthogonalization objective.

DensePhrases: Learning Dense Representations of Phrases at Scale

Lee et al.

ACL 2021

Code

Dense Phrase Embeddings: Every phrase gets dense vector. Novel negative sampling at phrase level. Can serve as knowledge base for multi-hop reasoning.

Implementation Patterns

Pattern 1: Score Fusion

Simplest approach: retrieve with both, combine scores.

# Retrieve from both indices
bm25_results = bm25.search(query, k=100)
dense_results = bi_encoder.search(query, k=100)

# Normalize scores to [0, 1]
bm25_scores = normalize(bm25_results.scores)
dense_scores = normalize(dense_results.scores)

# Weighted fusion
alpha = 0.5  # Tunable weight
for doc_id in union(bm25_results, dense_results):
    combined_score = alpha * bm25_scores[doc_id] + (1 - alpha) * dense_scores[doc_id]

# Rank by combined score
final_results = rank_by_score(combined_score)[:100]

Tuning alpha: Use validation set to find optimal weight (typically 0.3-0.7).

Pattern 2: Cascade Filtering

Use fast method first, then refine with slow method.

# Stage 1a: BM25 (very fast, cast wide net)
bm25_candidates = bm25.search(query, k=10000)

# Stage 1b: Dense re-rank to top-100
dense_scores = bi_encoder.score(query, bm25_candidates)
top_100 = rank_by_score(bm25_candidates, dense_scores)[:100]

Advantage: BM25 is so fast that retrieving 10K docs costs ~5ms extra but improves recall.

Pattern 3: Semantic Residual

Train neural model to capture what BM25 misses.

# From Gao et al., ECIR 2021
# Loss encourages dense model to be orthogonal to BM25

loss = contrastive_loss(query, pos, neg) \\
     + lambda * orthogonality_loss(dense_scores, bm25_scores)

# At inference: dense focuses on semantic gaps

When to Use Hybrid

Use Case Recommendations

Scenario

Recommendation

Mixed Query Types

Use hybrid (some queries keyword-heavy, some semantic)

Technical Domains

Use hybrid (entity names need exact match, concepts need semantics)

Maximum Recall

Use hybrid (BM25 catches what dense misses and vice versa)

Unknown Query Distribution

Start with hybrid (safer than choosing one)

Homogeneous Semantic Queries

Dense only (hybrid overhead not worth it)

Pure Keyword Search

BM25 only (faster, simpler)

Empirical Results

Typical improvements from hybrid over single method:

Dataset: MS MARCO Dev
BM25 only:        MRR@10 = 0.187
Dense only:       MRR@10 = 0.311
Hybrid (α=0.5):   MRR@10 = 0.336  (+8% over dense alone!)

Dataset: Natural Questions
BM25 only:        Recall@100 = 0.73
Dense only:       Recall@100 = 0.85
Hybrid (α=0.4):   Recall@100 = 0.89  (+4.7% over dense alone)

Pattern: Hybrid helps most when query distribution is diverse.

Implementation Resources

Libraries with Hybrid Support

# Haystack framework
from haystack.nodes import BM25Retriever, EmbeddingRetriever
from haystack.pipelines import Pipeline

bm25 = BM25Retriever(document_store)
dense = EmbeddingRetriever(document_store, model="BAAI/bge-base-en-v1.5")

# Haystack automatically fuses scores

# LlamaIndex
from llama_index import VectorStoreIndex, SimpleKeywordTableIndex
from llama_index.query_engine import RetrieverQueryEngine

# Combines vector and keyword retrieval

Best Practices

  1. Always tune the fusion weight (alpha) on validation set

  2. Normalize scores before combining (different ranges)

  3. Consider query type classification: route to BM25 or dense based on query

  4. Monitor both components: ensure neither is degraded

  5. Evaluate on diverse benchmark (BEIR has 18 datasets)

Next Steps