Cross-Encoders for Re-ranking ============================== Cross-encoders are the most accurate re-ranking models, processing query and document jointly through a single transformer to produce precise relevance scores. Architecture Overview --------------------- **How Cross-Encoders Work** Unlike bi-encoders that encode query and document separately, cross-encoders concatenate them and process together: .. code-block:: text Bi-Encoder (Stage 1): Query → BERT → embedding_q ┐ ├→ dot_product(emb_q, emb_d) → score Document → BERT → embedding_d ┘ Cross-Encoder (Stage 2): [CLS] Query [SEP] Document [SEP] → BERT → [CLS] token → Linear → score **The Key Difference**: * **Bi-encoder**: Similarity in embedding space (fast, pre-computable) * **Cross-encoder**: Full self-attention between query-document tokens (slow, accurate) Why Cross-Encoders Are More Accurate ------------------------------------- **Token-Level Interactions** The transformer's self-attention allows every query token to attend to every document token: * Query "capital" can attend to doc "capital", "city", "largest", etc. * Can perform multi-hop reasoning across tokens * Captures semantic composition (not just bag-of-words similarity) **Example:** Query: *"Who invented the telephone?"* Document: *"Alexander Graham Bell patented the telephone in 1876"* **Bi-encoder sees:** - High similarity (both contain "telephone", "Bell", etc.) - But can't connect "invented" → "patented" or "who" → "Alexander Graham Bell" **Cross-encoder sees:** - "who" attends to "Alexander Graham Bell" → Answer to question - "invented" attends to "patented" → Semantic equivalence - Full reasoning chain: This doc answers the query Implementation -------------- **Using Sentence-Transformers** .. code-block:: python from sentence_transformers import CrossEncoder # Load pre-trained cross-encoder model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # Score query-document pairs pairs = [ ("What is the capital of France?", "Paris is the capital of France"), ("What is the capital of France?", "France is in Europe"), ("What is the capital of France?", "Best restaurants in Paris") ] scores = model.predict(pairs) # scores: [0.98, 0.12, 0.35] - clearly ranks correct answer first **Training Your Own** .. code-block:: python from sentence_transformers import CrossEncoder, InputExample # Prepare training data train_samples = [ InputExample(texts=["query1", "relevant_doc"], label=1.0), InputExample(texts=["query1", "irrelevant_doc"], label=0.0), # ... more pairs ] # Initialize from pre-trained BERT model = CrossEncoder('bert-base-uncased', num_labels=1) # Train model.fit( train_dataloader=train_samples, epochs=3, warmup_steps=100 ) Variants and Improvements -------------------------- MonoT5 ^^^^^^ Instead of BERT, uses T5 (text-to-text transformer): .. code-block:: python # Input to T5 input_text = f"Query: {query} Document: {document} Relevant:" # T5 generates output = model.generate(input_text) # "true" or "false" # Score = probability of generating "true" **Advantage**: T5's generative nature may capture relevance better than classification head. RankT5 ^^^^^^ T5 that directly generates relevance scores: .. code-block:: python input_text = f"Query: {query} Document: {document} Score:" output = model.generate(input_text) # "0", "1", "2", ... "9" # 10-way classification via generation duoT5 ^^^^^ Pairwise ranking with T5: .. code-block:: python input_text = f"Query: {query} Document1: {doc1} Document2: {doc2} More relevant:" output = model.generate(input_text) # "Document1" or "Document2" **Advantage**: More stable than absolute scores (easier for model to judge relative relevance). Training Cross-Encoders with Hard Negatives -------------------------------------------- **The Same Hard Negative Problem Applies!** Cross-encoders also benefit from hard negative training: .. code-block:: python # Bad: Random negatives train_data = [(query, positive, random_doc) for ...] # Better: BM25 negatives train_data = [(query, positive, bm25_hard_neg) for ...] # Best: Bi-encoder mined negatives # These are docs that bi-encoder ranked high but are actually irrelevant bi_encoder_errors = bi_encoder.search(query, k=100) hard_negs = [doc for doc in bi_encoder_errors if not is_relevant(doc)] train_data = [(query, positive, hard_neg) for hard_neg in hard_negs] **Why This Works**: Cross-encoder learns to correct bi-encoder's mistakes. Training it on bi-encoder's hardest errors makes it the perfect "teacher" for Stage 2. Performance Benchmarks ----------------------- .. list-table:: Cross-Encoder Performance (MS MARCO) :header-rows: 1 :widths: 30 20 20 30 * - Model - MRR@10 - Latency (100 docs) - Size * - Bi-encoder only - 0.311 - ~10ms - 400MB * - + MiniLM-L6 Cross-encoder - 0.389 - ~3s - 90MB * - + MiniLM-L12 Cross-encoder - 0.402 - ~5s - 130MB * - + BERT-base Cross-encoder - 0.416 - ~8s - 420MB * - + BERT-large Cross-encoder - 0.428 - ~15s - 1.3GB **Trade-off**: Larger models = better accuracy but slower. Cost-Effective Choices ----------------------- **For Production (Recommended)** .. code-block:: python # MiniLM-L6: 85% of BERT-large performance, 10% of latency model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') **For Research/Maximum Accuracy** .. code-block:: python # BERT-large or T5-large model = CrossEncoder('cross-encoder/ms-marco-electra-base') # Faster than BERT **For Budget Constrained** .. code-block:: python # TinyBERT cross-encoder (custom trained) # Or use ColBERT as re-ranker (better speed-accuracy than small cross-encoder) Deployment Considerations -------------------------- **Batching** .. code-block:: python # Don't score one-by-one for doc in candidates: score = model.predict([(query, doc)]) # ❌ Slow! # Batch all pairs together pairs = [(query, doc) for doc in candidates] scores = model.predict(pairs) # ✅ Fast! (GPU batching) **GPU vs CPU** * GPU: ~50-100 pairs/second * CPU: ~10-20 pairs/second * For 100 candidates: 1-2s on GPU, 5-10s on CPU **Caching** For frequently-seen documents, cache scores: .. code-block:: python cache = {} # {(query_hash, doc_hash): score} if (query_hash, doc_hash) in cache: score = cache[(query_hash, doc_hash)] else: score = model.predict([(query, doc)])[0] cache[(query_hash, doc_hash)] = score Next Steps ---------- * See :doc:`llm_rerankers` for using large language models as re-rankers * See :doc:`../stage1_retrieval/late_interaction` for ColBERT as alternative * See :doc:`../stage1_retrieval/hard_mining` for training data quality