Cross-Encoders for Re-ranking¶
Cross-encoders are the most accurate re-ranking models, processing query and document jointly through a single transformer to produce precise relevance scores.
Architecture Overview¶
How Cross-Encoders Work
Unlike bi-encoders that encode query and document separately, cross-encoders concatenate them and process together:
Bi-Encoder (Stage 1):
Query → BERT → embedding_q ┐
├→ dot_product(emb_q, emb_d) → score
Document → BERT → embedding_d ┘
Cross-Encoder (Stage 2):
[CLS] Query [SEP] Document [SEP] → BERT → [CLS] token → Linear → score
The Key Difference:
Bi-encoder: Similarity in embedding space (fast, pre-computable)
Cross-encoder: Full self-attention between query-document tokens (slow, accurate)
Why Cross-Encoders Are More Accurate¶
Token-Level Interactions
The transformer’s self-attention allows every query token to attend to every document token:
Query “capital” can attend to doc “capital”, “city”, “largest”, etc.
Can perform multi-hop reasoning across tokens
Captures semantic composition (not just bag-of-words similarity)
Example:
Query: “Who invented the telephone?”
Document: “Alexander Graham Bell patented the telephone in 1876”
Bi-encoder sees: - High similarity (both contain “telephone”, “Bell”, etc.) - But can’t connect “invented” → “patented” or “who” → “Alexander Graham Bell”
Cross-encoder sees: - “who” attends to “Alexander Graham Bell” → Answer to question - “invented” attends to “patented” → Semantic equivalence - Full reasoning chain: This doc answers the query
Implementation¶
Using Sentence-Transformers
from sentence_transformers import CrossEncoder
# Load pre-trained cross-encoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Score query-document pairs
pairs = [
("What is the capital of France?", "Paris is the capital of France"),
("What is the capital of France?", "France is in Europe"),
("What is the capital of France?", "Best restaurants in Paris")
]
scores = model.predict(pairs)
# scores: [0.98, 0.12, 0.35] - clearly ranks correct answer first
Training Your Own
from sentence_transformers import CrossEncoder, InputExample
# Prepare training data
train_samples = [
InputExample(texts=["query1", "relevant_doc"], label=1.0),
InputExample(texts=["query1", "irrelevant_doc"], label=0.0),
# ... more pairs
]
# Initialize from pre-trained BERT
model = CrossEncoder('bert-base-uncased', num_labels=1)
# Train
model.fit(
train_dataloader=train_samples,
epochs=3,
warmup_steps=100
)
Variants and Improvements¶
MonoT5¶
Instead of BERT, uses T5 (text-to-text transformer):
# Input to T5
input_text = f"Query: {query} Document: {document} Relevant:"
# T5 generates
output = model.generate(input_text) # "true" or "false"
# Score = probability of generating "true"
Advantage: T5’s generative nature may capture relevance better than classification head.
RankT5¶
T5 that directly generates relevance scores:
input_text = f"Query: {query} Document: {document} Score:"
output = model.generate(input_text) # "0", "1", "2", ... "9"
# 10-way classification via generation
duoT5¶
Pairwise ranking with T5:
input_text = f"Query: {query} Document1: {doc1} Document2: {doc2} More relevant:"
output = model.generate(input_text) # "Document1" or "Document2"
Advantage: More stable than absolute scores (easier for model to judge relative relevance).
Training Cross-Encoders with Hard Negatives¶
The Same Hard Negative Problem Applies!
Cross-encoders also benefit from hard negative training:
# Bad: Random negatives
train_data = [(query, positive, random_doc) for ...]
# Better: BM25 negatives
train_data = [(query, positive, bm25_hard_neg) for ...]
# Best: Bi-encoder mined negatives
# These are docs that bi-encoder ranked high but are actually irrelevant
bi_encoder_errors = bi_encoder.search(query, k=100)
hard_negs = [doc for doc in bi_encoder_errors if not is_relevant(doc)]
train_data = [(query, positive, hard_neg) for hard_neg in hard_negs]
Why This Works:
Cross-encoder learns to correct bi-encoder’s mistakes. Training it on bi-encoder’s hardest errors makes it the perfect “teacher” for Stage 2.
Performance Benchmarks¶
Model |
Latency (100 docs) |
Size |
|
|---|---|---|---|
Bi-encoder only |
0.311 |
~10ms |
400MB |
|
0.389 |
~3s |
90MB |
|
0.402 |
~5s |
130MB |
|
0.416 |
~8s |
420MB |
|
0.428 |
~15s |
1.3GB |
Trade-off: Larger models = better accuracy but slower.
Cost-Effective Choices¶
For Production (Recommended)
# MiniLM-L6: 85% of BERT-large performance, 10% of latency
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
For Research/Maximum Accuracy
# BERT-large or T5-large
model = CrossEncoder('cross-encoder/ms-marco-electra-base') # Faster than BERT
For Budget Constrained
# TinyBERT cross-encoder (custom trained)
# Or use ColBERT as re-ranker (better speed-accuracy than small cross-encoder)
Deployment Considerations¶
Batching
# Don't score one-by-one
for doc in candidates:
score = model.predict([(query, doc)]) # ❌ Slow!
# Batch all pairs together
pairs = [(query, doc) for doc in candidates]
scores = model.predict(pairs) # ✅ Fast! (GPU batching)
GPU vs CPU
GPU: ~50-100 pairs/second
CPU: ~10-20 pairs/second
For 100 candidates: 1-2s on GPU, 5-10s on CPU
Caching
For frequently-seen documents, cache scores:
cache = {} # {(query_hash, doc_hash): score}
if (query_hash, doc_hash) in cache:
score = cache[(query_hash, doc_hash)]
else:
score = model.predict([(query, doc)])[0]
cache[(query_hash, doc_hash)] = score
Next Steps¶
See LLM-Based Re-rankers for using large language models as re-rankers
See Late Interaction (ColBERT) for ColBERT as alternative
See Hard Negative Mining for training data quality