Hybrid Dense-Sparse Methods ============================ Hybrid methods combine the strengths of both sparse (BM25) and dense (neural) retrieval to achieve robustness across different query types. The Complementarity Principle ------------------------------ **BM25 Strengths:** * Exact keyword matching (entity names, IDs, codes) * No vocabulary mismatch for technical terms * Works well for rare terms **Dense Strengths:** * Semantic matching (synonyms, paraphrases) * Handles natural language questions * Captures context and intent **Together:** Cover both keyword-heavy and semantic queries. Hybrid Methods Literature -------------------------- .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DENSPI) `_ - Seo et al. - ACL 2019 - `Code `_ - **Dense-Sparse Phrase Index**: Combines dense vectors with sparse phrase matching. Real-time phrase-level retrieval. Stores billions of phrase representations for precise extraction. * - `Complementing Lexical Retrieval with Semantic Residual Embedding `_ - Gao et al. - ECIR 2021 - NA - **Learn What BM25 Misses**: Neural model learns semantic "residual"—what BM25 fails to capture. Highly efficient complementarity through orthogonalization objective. * - `DensePhrases: Learning Dense Representations of Phrases at Scale `_ - Lee et al. - ACL 2021 - `Code `_ - **Dense Phrase Embeddings**: Every phrase gets dense vector. Novel negative sampling at phrase level. Can serve as knowledge base for multi-hop reasoning. Implementation Patterns ----------------------- Pattern 1: Score Fusion ^^^^^^^^^^^^^^^^^^^^^^^ Simplest approach: retrieve with both, combine scores. .. code-block:: python # Retrieve from both indices bm25_results = bm25.search(query, k=100) dense_results = bi_encoder.search(query, k=100) # Normalize scores to [0, 1] bm25_scores = normalize(bm25_results.scores) dense_scores = normalize(dense_results.scores) # Weighted fusion alpha = 0.5 # Tunable weight for doc_id in union(bm25_results, dense_results): combined_score = alpha * bm25_scores[doc_id] + (1 - alpha) * dense_scores[doc_id] # Rank by combined score final_results = rank_by_score(combined_score)[:100] **Tuning alpha**: Use validation set to find optimal weight (typically 0.3-0.7). Pattern 2: Cascade Filtering ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use fast method first, then refine with slow method. .. code-block:: python # Stage 1a: BM25 (very fast, cast wide net) bm25_candidates = bm25.search(query, k=10000) # Stage 1b: Dense re-rank to top-100 dense_scores = bi_encoder.score(query, bm25_candidates) top_100 = rank_by_score(bm25_candidates, dense_scores)[:100] **Advantage**: BM25 is so fast that retrieving 10K docs costs ~5ms extra but improves recall. Pattern 3: Semantic Residual ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Train neural model to capture what BM25 misses. .. code-block:: python # From Gao et al., ECIR 2021 # Loss encourages dense model to be orthogonal to BM25 loss = contrastive_loss(query, pos, neg) \\ + lambda * orthogonality_loss(dense_scores, bm25_scores) # At inference: dense focuses on semantic gaps When to Use Hybrid ------------------ .. list-table:: Use Case Recommendations :header-rows: 1 :widths: 30 70 * - Scenario - Recommendation * - **Mixed Query Types** - Use hybrid (some queries keyword-heavy, some semantic) * - **Technical Domains** - Use hybrid (entity names need exact match, concepts need semantics) * - **Maximum Recall** - Use hybrid (BM25 catches what dense misses and vice versa) * - **Unknown Query Distribution** - Start with hybrid (safer than choosing one) * - **Homogeneous Semantic Queries** - Dense only (hybrid overhead not worth it) * - **Pure Keyword Search** - BM25 only (faster, simpler) Empirical Results ----------------- Typical improvements from hybrid over single method: .. code-block:: text Dataset: MS MARCO Dev BM25 only: MRR@10 = 0.187 Dense only: MRR@10 = 0.311 Hybrid (α=0.5): MRR@10 = 0.336 (+8% over dense alone!) Dataset: Natural Questions BM25 only: Recall@100 = 0.73 Dense only: Recall@100 = 0.85 Hybrid (α=0.4): Recall@100 = 0.89 (+4.7% over dense alone) **Pattern**: Hybrid helps most when query distribution is diverse. Implementation Resources ------------------------- **Libraries with Hybrid Support** .. code-block:: python # Haystack framework from haystack.nodes import BM25Retriever, EmbeddingRetriever from haystack.pipelines import Pipeline bm25 = BM25Retriever(document_store) dense = EmbeddingRetriever(document_store, model="BAAI/bge-base-en-v1.5") # Haystack automatically fuses scores # LlamaIndex from llama_index import VectorStoreIndex, SimpleKeywordTableIndex from llama_index.query_engine import RetrieverQueryEngine # Combines vector and keyword retrieval Best Practices -------------- 1. **Always tune the fusion weight (alpha)** on validation set 2. **Normalize scores** before combining (different ranges) 3. **Consider query type classification**: route to BM25 or dense based on query 4. **Monitor both components**: ensure neither is degraded 5. **Evaluate on diverse benchmark** (BEIR has 18 datasets) Next Steps ---------- * See :doc:`sparse` for detailed BM25 explanation * See :doc:`dense_baselines` for dense retrieval fundamentals * See :doc:`hard_mining` for improving dense component * See :doc:`late_interaction` for ColBERT's unified approach