LLM-Based Re-rankers¶
Large Language Models (LLMs) can perform re-ranking through prompting, offering zero-shot capability and explainability at the cost of higher latency and compute.
The LLM Re-ranking Paradigm¶
Traditional Cross-Encoder:
Requires training on labeled (query, doc, relevance) data
Fixed to specific task/domain
Fast inference (~50ms per pair)
No explanation
LLM Re-ranker:
Zero-shot via prompting (no training needed)
Generalizes across tasks
Slower inference (~500-2000ms per pair)
Can provide reasoning
Zero-Shot Prompting Approaches¶
Pointwise Relevance¶
Method: Ask LLM to judge each document independently.
prompt = f"""
Given the query: "{query}"
And the document: "{document}"
Is this document relevant to the query?
Answer with only "Yes" or "No".
"""
response = llm.generate(prompt)
score = 1.0 if response == "Yes" else 0.0
Problem: No relative comparison, binary scores limit ranking.
Pairwise Comparison¶
Method: Ask LLM to compare pairs of documents.
prompt = f"""
Query: "{query}"
Document A: "{doc_a}"
Document B: "{doc_b}"
Which document is more relevant to the query?
Answer with "A" or "B".
"""
response = llm.generate(prompt)
# Build ranking via pairwise comparisons (like bubble sort)
Advantage: Relative judgments are easier than absolute scores.
Problem: Requires O(n²) comparisons for n documents.
Listwise Ranking¶
Method: Ask LLM to rank entire list at once.
prompt = f"""
Query: "{query}"
Rank the following documents by relevance:
[1] {doc_1}
[2] {doc_2}
[3] {doc_3}
...
[10] {doc_10}
Provide the ranking as a list of numbers (e.g., [3, 1, 7, ...]).
"""
response = llm.generate(prompt)
# Parse: [3, 1, 7, ...] means doc 3 is most relevant
Advantage: Single LLM call, considers all documents together.
Problem: Performance degrades with >20 documents (context length, attention issues).
Sliding Window Approach¶
Method: For large candidate sets, use sliding window.
# RankGPT approach
window_size = 20
sorted_docs = candidates.copy()
for i in range(0, len(candidates), window_size):
window = sorted_docs[i:i+window_size]
reranked_window = llm.listwise_rank(query, window)
sorted_docs[i:i+window_size] = reranked_window
# Refine with multiple passes
for pass_num in range(3):
sorted_docs = sliding_window_rank(sorted_docs, window_size)
Cost vs Performance¶
Model |
Latency (100 docs) |
Cost (100 docs) |
Accuracy |
Zero-shot? |
|---|---|---|---|---|
MiniLM Cross-encoder |
~2s |
$0.000 |
Good |
No (needs training) |
GPT-3.5 (pointwise) |
~60s |
$0.20 |
Better |
Yes |
GPT-4 (pointwise) |
~120s |
$2.00 |
Best |
Yes |
GPT-3.5 (listwise) |
~10s |
$0.05 |
Better |
Yes |
GPT-4 (listwise) |
~20s |
$0.50 |
Best |
Yes |
RankLlama (self-hosted) |
~30s |
$0.000 |
Good |
Yes |
Key Insight: LLMs are 10-100x more expensive than trained cross-encoders but offer zero-shot capability.
When to Use LLM Re-rankers¶
✅ Use LLM Re-rankers When:
No training data available (pure zero-shot)
Need explainability (LLM can explain why doc is relevant)
Domain shifts frequently (no time to retrain)
Budget allows ($0.10-$1.00 per query acceptable)
Accuracy is paramount (research, high-value queries)
❌ Don’t Use When:
Serving millions of queries (cost prohibitive)
Latency < 5s required (LLMs too slow)
Have good training data (cross-encoder is better value)
Queries are simple (overkill)
Practical Implementation¶
Cost Optimization: Staged LLM Re-ranking
# Stage 1: Bi-encoder (10M → 1000)
candidates_1k = bi_encoder.search(query, corpus, top_k=1000)
# Stage 2: Fast cross-encoder (1000 → 100)
candidates_100 = cross_encoder.rerank(query, candidates_1k, top_k=100)
# Stage 3: LLM re-ranking (100 → 10)
# Only use expensive LLM on final 100
final_10 = llm_reranker.rerank(query, candidates_100, top_k=10)
Cost: 100 docs × $0.0005 = $0.05 per query (vs $5.00 if LLM on all 10K)
Caching for Repeated Queries
import hashlib
def get_cache_key(query, doc):
return hashlib.md5(f"{query}::{doc}".encode()).hexdigest()
cache = {} # Persistent cache (Redis, DynamoDB, etc.)
if cache_key in cache:
score = cache[cache_key] # Free!
else:
score = llm.rank(query, doc) # Expensive
cache[cache_key] = score
Batch Processing for Cost
# Instead of real-time, batch queries
queries_batch = collect_queries_for_10_minutes()
# Send all to LLM in one request (cheaper bulk pricing)
all_pairs = [(q, d) for q in queries_batch for d in candidates[q]]
all_scores = llm.batch_predict(all_pairs) # Bulk API rates
Best Practices¶
Prompt Engineering
# ❌ Bad prompt (ambiguous)
"Is this relevant?"
# ✅ Good prompt (clear instructions)
"You are a search quality evaluator. Given the query and document below,
determine if the document directly answers the query or provides the
information the user is looking for. Consider factual accuracy and
completeness. Respond with only 'Relevant' or 'Not Relevant'.
Query: {query}
Document: {document}
Judgment:"
Few-Shot Examples
Include examples in prompt for better calibration:
Here are examples of relevant and irrelevant documents:
Example 1:
Query: "What is photosynthesis?"
Document: "Photosynthesis is the process by which plants convert sunlight to energy."
Judgment: Relevant
Example 2:
Query: "What is photosynthesis?"
Document: "Plants are green and grow in soil."
Judgment: Not Relevant
Now judge this pair:
Query: "{query}"
Document: "{document}"
Judgment:
Future Directions¶
Active Research Areas:
Distillation: Train small cross-encoder to mimic LLM judgments
Efficient Prompting: Compress documents before passing to LLM
Hybrid Scoring: Combine LLM with traditional cross-encoder
Explanation Generation: Use LLM to explain ranking decisions
Multi-modal: LLM re-rankers for images, videos
Emerging Models:
RankLlama: LLama-2 fine-tuned specifically for ranking
RankGPT: GPT-based with specialized prompting
PRP: Pairwise Ranking Prompting
Self-Consistency: Multiple LLM calls then vote
Next Steps¶
See Cross-Encoders for Re-ranking for traditional trained re-rankers
See Hard Negative Mining for improving training data
See Late Interaction (ColBERT) for ColBERT alternative