Late Interaction (ColBERT) =========================== **Do they do Stage-1 and Stage-2 together?** Short Answer ------------ **Yes, late interaction models like ColBERT bridge the gap between Stage 1 and Stage 2.** They can serve as: * **High-quality Stage 1**: Retrieval from millions of documents (with optimized indexing) * **Fine-grained Stage 2**: Token-level matching that mimics cross-encoder precision Long Answer: Why Late Interaction is Special --------------------------------------------- **Standard Dense Retrieval (Stage 1)** * Compresses entire document into single vector (768-d) * Fast but loses nuance * Can't capture fine-grained matches **Cross-Encoders (Stage 2)** * Full attention between every query-document token pair * Extremely accurate but prohibitively slow * Can't pre-compute (needs query at inference time) **Late Interaction (ColBERT) - The Bridge** * Stores vector for *every token* in document * Performs interaction *after* retrieval ("late") * Fast enough for Stage 1, accurate enough for Stage 2 The ColBERT Architecture ------------------------- .. code-block:: text Standard Bi-Encoder: ┌─────────┐ ┌──────────┐ │ Query │ → BERT → [768-d] → │ │ │ │ │ Dot Prod │ → Score │Document │ → BERT → [768-d] → │ │ └─────────┘ └──────────┘ ColBERT (Late Interaction): ┌─────────┐ ┌──────────┐ │ Query │ → BERT → [32 x 128-d] →│ │ │ (32 tok)│ │MaxSim│ → Score │Document │ → BERT → [200 x 128-d]→│ │ │(200 tok)│ └──────┘ └─────────┘ MaxSim Operation: For each query token, find max similarity with any document token, then sum across all query tokens. ColBERT Literature ------------------ .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT `_ - Khattab & Zaharia - SIGIR 2020 - `Code `_ - **Late Interaction**: Retains token-level embeddings and computes MaxSim *after* retrieval. Captures fine details like cross-encoders but indexable like bi-encoders. * - `ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction `_ - Santhanam et al. - NAACL 2022 - `Code `_ - **Compression + Denoising**: Residual compression reduces index size by 6-10x. Denoised supervision from cross-encoder improves quality. Enables billion-scale retrieval. Other Multi-Vector Methods --------------------------- .. list-table:: :header-rows: 1 :widths: 25 12 8 10 45 * - Paper - Author - Venue - Code - Key Innovation * - `Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring `_ - Humeau et al. - ICLR 2020 - `Code `_ - **Attention over Codes**: Learned codes represent the candidate document. Query attends to these codes. Balance of speed/accuracy; versatile for dialogue systems. * - `ME-BERT: Multi-Vector Encoding for Document Retrieval `_ - Luan et al. - arXiv 2020 - NA - **Multi-Vector per Doc**: Multiple vectors per document to capture diverse topics. Each vector represents different aspect/sub-topic. How ColBERT Does Both Stages ----------------------------- ColBERT v2 with PLAID ^^^^^^^^^^^^^^^^^^^^^ The key innovation is the **PLAID (Performance-optimized Late Interaction Driver)** index: **Stage 1 Capability: Fast Retrieval** 1. **Centroid-based pruning**: Groups similar token embeddings into centroids 2. **Early termination**: Stops scoring if document clearly won't be in top-k 3. **Quantization**: Compresses embeddings from 128-d float to 2-bit integers 4. **Result**: Can search 10M documents in ~50-100ms **Stage 2 Capability: Fine-grained Matching** 1. **Token-level MaxSim**: Each query token finds best matching document token 2. **Captures phrases**: "capital of France" can match non-contiguous tokens 3. **Position awareness**: Different tokens can match different parts 4. **Result**: Accuracy approaching cross-encoders Performance Comparison ---------------------- .. list-table:: Speed vs Accuracy Trade-off :header-rows: 1 :widths: 20 20 20 20 20 * - Method - Latency - Accuracy - Index Size - Use Case * - BM25 - ~1ms - Baseline - Small (GB) - Keywords * - Bi-Encoder - ~10ms - +15% - Medium (10GB) - Semantic * - ColBERT - ~50ms - +25% - Large (100GB) - Both stages * - Cross-Encoder - ~1000ms/doc - +30% - None - Stage 2 only **Key Insight**: ColBERT is 20x slower than bi-encoder but 20x faster than cross-encoder. When to Use ColBERT ------------------- ✅ **Use ColBERT As Primary Retriever When:** * Accuracy is critical (medical, legal, research) * You can afford the larger index (100-300GB for 10M docs) * Latency budget allows 50-200ms * Want single-stage solution (no separate re-ranker needed) ✅ **Use ColBERT As Re-ranker When:** * Bi-encoder retrieves top-1000 * ColBERT MaxSim re-ranks to top-100 * Cross-encoder (optional) produces final top-10 * Best of all worlds: fast initial retrieval, precise final ranking ❌ **Don't Use ColBERT When:** * Index size is constrained (edge devices, mobile) * Need sub-10ms latency (real-time autocomplete) * Corpus is small enough for cross-encoder on everything (<10K docs) Implementation Example ----------------------- **Basic ColBERT Retrieval** .. code-block:: python from colbert import Searcher from colbert.infra import ColBERTConfig # Initialize config = ColBERTConfig(root="experiments/") searcher = Searcher(index="my_index", config=config) # Search query = "What is the capital of France?" results = searcher.search(query, k=10) for passage_id, rank, score in results: print(f"Rank {rank}: {passages[passage_id]} (score: {score:.2f})") **ColBERT as Re-ranker** .. code-block:: python # Stage 1: Bi-encoder retrieves candidates bi_encoder = SentenceTransformer('BAAI/bge-base-en-v1.5') candidates = bi_encoder.search(query, corpus, top_k=1000) # Stage 2: ColBERT re-ranks colbert_scores = [] for doc in candidates: score = colbert.maxsim(query, doc) # Token-level matching colbert_scores.append(score) # Top-100 after ColBERT re-ranking top_100 = rank_by_score(candidates, colbert_scores)[:100] # Stage 3 (optional): Cross-encoder for final top-10 cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') final_top_10 = cross_encoder.rerank(query, top_100)[:10] The Index Size Problem ----------------------- **Why ColBERT Indexes Are Large** Standard bi-encoder: .. math:: \text{Index Size} = N \times d \times 4 \text{ bytes} \text{For 10M docs: } 10M \times 768 \times 4 = 30\text{GB} ColBERT: .. math:: \text{Index Size} = N \times \text{avg\_tokens} \times d \times 4 \text{For 10M docs: } 10M \times 200 \times 128 \times 4 = 1\text{TB} **ColBERTv2 Solutions**: 1. **Compression**: Quantization reduces to ~100GB (10x smaller) 2. **Pruning**: Remove low-importance tokens 3. **Residual encoding**: Store deltas from centroids Advanced Topic: MaxSim Operation --------------------------------- The MaxSim operation is what gives ColBERT its power: .. code-block:: python def maxsim(query_embeddings, doc_embeddings): """ query_embeddings: (num_query_tokens, 128) doc_embeddings: (num_doc_tokens, 128) """ # Compute all pairwise similarities similarities = query_embeddings @ doc_embeddings.T # (Q, D) # For each query token, find max similarity with any doc token max_per_query_token = similarities.max(dim=1).values # (Q,) # Sum across all query tokens score = max_per_query_token.sum() return score **Why This Works**: * Query token "capital" matches doc token "capital" (exact) * Query token "France" matches doc tokens "French", "Paris" (semantic) * Flexible matching while maintaining efficiency Comparison with Cross-Encoders ------------------------------- .. list-table:: ColBERT vs Cross-Encoder :header-rows: 1 :widths: 30 35 35 * - Dimension - ColBERT - Cross-Encoder * - **Encoding** - Independent (query, doc separate) - Joint ([CLS] query [SEP] doc [SEP]) * - **Interaction** - Late (after encoding) - Early (full self-attention) * - **Pre-computation** - Yes (doc embeddings offline) - No (must encode each pair) * - **Speed for 100 docs** - ~50ms (MaxSim is cheap) - ~5000ms (100 forward passes) * - **Accuracy** - 90-95% of cross-encoder - Best possible * - **Best Use** - Stage 1 or Stage 2 - Stage 2 only Poly-Encoders: Another Middle Ground ------------------------------------- Poly-encoders offer a different trade-off: **Architecture**: 1. Document → BERT → Multiple "code" vectors (e.g., 64 codes) 2. Query → BERT → Single query vector 3. Query attends to document codes 4. Weighted sum of codes based on attention **Advantage**: More flexible than ColBERT's MaxSim **Disadvantage**: More complex, less interpretable **Use Case**: Dialogue systems, where documents are short and interaction patterns complex Research Directions ------------------- Current research on late interaction focuses on: 1. **Reducing Index Size**: Can we get ColBERT quality with bi-encoder size? 2. **Dynamic Pruning**: Adaptively decide which tokens to keep 3. **Learned Aggregation**: Learn better operations than MaxSim 4. **Multi-modal**: Extend late interaction to images, video 5. **Long Documents**: Handle documents with thousands of tokens Next Steps ---------- * See :doc:`hard_mining` for how ColBERT trains with hard negatives * See :doc:`dense_baselines` for comparison with standard bi-encoders * See :doc:`hybrid` for combining ColBERT with BM25 * See :doc:`../stage2_reranking/cross_encoders` for true Stage 2 methods