Sparse Retrieval Methods ========================= Sparse retrieval methods dominated information retrieval for decades before the advent of dense neural methods. They remain relevant today as fast, interpretable baselines and as components of hybrid systems. BM25: The Classic Baseline --------------------------- **Best Matching 25** (BM25) is the most widely used sparse retrieval algorithm. Key Characteristics ^^^^^^^^^^^^^^^^^^^ * **Bag-of-Words**: Ignores word order, focuses on term frequency * **Inverted Index**: Pre-computed data structure for fast lookup * **TF-IDF Based**: Balances term frequency with document frequency * **Deterministic**: No training required, pure statistics The BM25 Formula ^^^^^^^^^^^^^^^^ .. math:: \text{score}(D, Q) = \sum_{i=1}^{n} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})} Where: - :math:`f(q_i, D)` = frequency of term q\ :sub:`i` in document D - :math:`|D|` = length of document D - :math:`avgdl` = average document length in collection - :math:`k_1` = term frequency saturation parameter (typically 1.2-2.0) - :math:`b` = length normalization parameter (typically 0.75) Strengths ^^^^^^^^^ ✅ **Very Fast**: Index lookup is O(log N) with inverted index ✅ **No Training**: Works out-of-the-box on any text ✅ **Interpretable**: Can explain scores via term matching ✅ **Robust**: Works well for keyword-heavy queries ✅ **Memory Efficient**: Only stores term statistics Weaknesses ^^^^^^^^^^ ❌ **Vocabulary Mismatch**: Fails when query and document use different words ❌ **No Semantics**: Can't match "car" with "automobile" ❌ **Keyword Dependent**: Poor for natural language questions ❌ **No Context**: Treats all occurrences of a word identically When to Use BM25 ---------------- BM25 is still the **best choice** for: **Legal and Medical Search** Documents and queries use precise technical terminology. Example: "42 USC 1983" should match exact statute. **Code Search** Identifiers and function names should match exactly. Example: "numpy.array" shouldn't match "array list" **Hybrid Stage 1** BM25 casts a wide net (10K candidates), then dense retrieval re-ranks. **Evaluation Baseline** Always compare dense methods against BM25 to show semantic improvement. Implementation -------------- **Python (Pyserini/Lucene)** .. code-block:: python from pyserini.search import SimpleSearcher # Index your corpus searcher = SimpleSearcher('indexes/my-corpus') searcher.set_bm25(k1=0.9, b=0.4) # Tune hyperparameters # Search hits = searcher.search('what is the capital of France', k=100) for hit in hits: print(f"Doc {hit.docid}: {hit.score:.4f}") **Python (Elasticsearch)** .. code-block:: python from elasticsearch import Elasticsearch es = Elasticsearch(['localhost:9200']) query = { "query": { "match": { "content": { "query": "capital of France", "operator": "or" } } } } results = es.search(index="my-corpus", body=query, size=100) Modern Sparse Methods ---------------------- Recent research has developed **learned sparse** methods that retain BM25's efficiency while adding semantic capability. SPLADE ^^^^^^ **Sparse Lexical and Expansion** model learns to: * Up-weight important terms (like BM25) * Add related terms not in the original text (expansion) * All while maintaining sparse representations for efficient indexing .. code-block:: python from splade import Splade model = Splade('naver/splade-cocondenser-ensembledistil') # Produces sparse vector with learned weights query_vec = model.encode("what is the capital of France") # Non-zero entries might include: "capital", "France", "Paris" (expansion!) **Advantages over BM25:** - Handles synonyms and related terms - Still uses inverted index - Can leverage neural pre-training Recommendations --------------- **For New Projects** 1. **Start with BM25** as baseline (always measure improvement over it) 2. **Add Dense Retrieval** (bi-encoder) for semantic matching 3. **Consider Hybrid** (BM25 + Dense) for robustness 4. **SPLADE** if you want learned sparse (best of both worlds) **Hyperparameter Tuning** BM25 has two key parameters: * **k1** (term frequency saturation): Try 0.9, 1.2, 2.0 * **b** (length normalization): Try 0.4, 0.75, 1.0 Optimal values depend on your corpus: - Short documents (e.g., tweets): lower b (0.3-0.5) - Long documents (e.g., articles): higher b (0.7-0.9) - Use grid search on validation set Resources --------- **Libraries** * `Pyserini `_ - Python wrapper for Lucene (BM25) * `Elasticsearch `_ - Distributed search engine * `SPLADE `_ - Learned sparse retrieval **Papers** * Robertson & Zaragoza (2009) - "The Probabilistic Relevance Framework: BM25 and Beyond" * Formal et al. (2021) - "SPLADE: Sparse Lexical and Expansion Model" **Datasets for Evaluation** * MS MARCO Passage Ranking * Natural Questions (NQ) * BEIR (for zero-shot evaluation) Next Steps ---------- * Proceed to :doc:`dense_baselines` to see how neural methods improve over BM25 * See :doc:`hybrid` for combining sparse and dense methods