Cross-Encoders for Re-ranking

Cross-encoders are the most accurate re-ranking models, processing query and document jointly through a single transformer to produce precise relevance scores.

Architecture Overview

How Cross-Encoders Work

Unlike bi-encoders that encode query and document separately, cross-encoders concatenate them and process together:

Bi-Encoder (Stage 1):
Query    → BERT → embedding_q ┐
                               ├→ dot_product(emb_q, emb_d) → score
Document → BERT → embedding_d ┘

Cross-Encoder (Stage 2):
[CLS] Query [SEP] Document [SEP] → BERT → [CLS] token → Linear → score

The Key Difference:

  • Bi-encoder: Similarity in embedding space (fast, pre-computable)

  • Cross-encoder: Full self-attention between query-document tokens (slow, accurate)

Why Cross-Encoders Are More Accurate

Token-Level Interactions

The transformer’s self-attention allows every query token to attend to every document token:

  • Query “capital” can attend to doc “capital”, “city”, “largest”, etc.

  • Can perform multi-hop reasoning across tokens

  • Captures semantic composition (not just bag-of-words similarity)

Example:

Query: “Who invented the telephone?”

Document: “Alexander Graham Bell patented the telephone in 1876”

Bi-encoder sees: - High similarity (both contain “telephone”, “Bell”, etc.) - But can’t connect “invented” → “patented” or “who” → “Alexander Graham Bell”

Cross-encoder sees: - “who” attends to “Alexander Graham Bell” → Answer to question - “invented” attends to “patented” → Semantic equivalence - Full reasoning chain: This doc answers the query

Implementation

Using Sentence-Transformers

from sentence_transformers import CrossEncoder

# Load pre-trained cross-encoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Score query-document pairs
pairs = [
    ("What is the capital of France?", "Paris is the capital of France"),
    ("What is the capital of France?", "France is in Europe"),
    ("What is the capital of France?", "Best restaurants in Paris")
]

scores = model.predict(pairs)
# scores: [0.98, 0.12, 0.35] - clearly ranks correct answer first

Training Your Own

from sentence_transformers import CrossEncoder, InputExample

# Prepare training data
train_samples = [
    InputExample(texts=["query1", "relevant_doc"], label=1.0),
    InputExample(texts=["query1", "irrelevant_doc"], label=0.0),
    # ... more pairs
]

# Initialize from pre-trained BERT
model = CrossEncoder('bert-base-uncased', num_labels=1)

# Train
model.fit(
    train_dataloader=train_samples,
    epochs=3,
    warmup_steps=100
)

Variants and Improvements

MonoT5

Instead of BERT, uses T5 (text-to-text transformer):

# Input to T5
input_text = f"Query: {query} Document: {document} Relevant:"

# T5 generates
output = model.generate(input_text)  # "true" or "false"

# Score = probability of generating "true"

Advantage: T5’s generative nature may capture relevance better than classification head.

RankT5

T5 that directly generates relevance scores:

input_text = f"Query: {query} Document: {document} Score:"
output = model.generate(input_text)  # "0", "1", "2", ... "9"

# 10-way classification via generation

duoT5

Pairwise ranking with T5:

input_text = f"Query: {query} Document1: {doc1} Document2: {doc2} More relevant:"
output = model.generate(input_text)  # "Document1" or "Document2"

Advantage: More stable than absolute scores (easier for model to judge relative relevance).

Training Cross-Encoders with Hard Negatives

The Same Hard Negative Problem Applies!

Cross-encoders also benefit from hard negative training:

# Bad: Random negatives
train_data = [(query, positive, random_doc) for ...]

# Better: BM25 negatives
train_data = [(query, positive, bm25_hard_neg) for ...]

# Best: Bi-encoder mined negatives
# These are docs that bi-encoder ranked high but are actually irrelevant
bi_encoder_errors = bi_encoder.search(query, k=100)
hard_negs = [doc for doc in bi_encoder_errors if not is_relevant(doc)]
train_data = [(query, positive, hard_neg) for hard_neg in hard_negs]

Why This Works:

Cross-encoder learns to correct bi-encoder’s mistakes. Training it on bi-encoder’s hardest errors makes it the perfect “teacher” for Stage 2.

Performance Benchmarks

Cross-Encoder Performance (MS MARCO)

Model

MRR@10

Latency (100 docs)

Size

Bi-encoder only

0.311

~10ms

400MB

  • MiniLM-L6 Cross-encoder

0.389

~3s

90MB

  • MiniLM-L12 Cross-encoder

0.402

~5s

130MB

  • BERT-base Cross-encoder

0.416

~8s

420MB

  • BERT-large Cross-encoder

0.428

~15s

1.3GB

Trade-off: Larger models = better accuracy but slower.

Cost-Effective Choices

For Production (Recommended)

# MiniLM-L6: 85% of BERT-large performance, 10% of latency
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

For Research/Maximum Accuracy

# BERT-large or T5-large
model = CrossEncoder('cross-encoder/ms-marco-electra-base')  # Faster than BERT

For Budget Constrained

# TinyBERT cross-encoder (custom trained)
# Or use ColBERT as re-ranker (better speed-accuracy than small cross-encoder)

Deployment Considerations

Batching

# Don't score one-by-one
for doc in candidates:
    score = model.predict([(query, doc)])  # ❌ Slow!

# Batch all pairs together
pairs = [(query, doc) for doc in candidates]
scores = model.predict(pairs)  # ✅ Fast! (GPU batching)

GPU vs CPU

  • GPU: ~50-100 pairs/second

  • CPU: ~10-20 pairs/second

  • For 100 candidates: 1-2s on GPU, 5-10s on CPU

Caching

For frequently-seen documents, cache scores:

cache = {}  # {(query_hash, doc_hash): score}

if (query_hash, doc_hash) in cache:
    score = cache[(query_hash, doc_hash)]
else:
    score = model.predict([(query, doc)])[0]
    cache[(query_hash, doc_hash)] = score

Next Steps