LLM-Based Re-rankers

Large Language Models (LLMs) can perform re-ranking through prompting, offering zero-shot capability and explainability at the cost of higher latency and compute.

The LLM Re-ranking Paradigm

Traditional Cross-Encoder:

  • Requires training on labeled (query, doc, relevance) data

  • Fixed to specific task/domain

  • Fast inference (~50ms per pair)

  • No explanation

LLM Re-ranker:

  • Zero-shot via prompting (no training needed)

  • Generalizes across tasks

  • Slower inference (~500-2000ms per pair)

  • Can provide reasoning

Zero-Shot Prompting Approaches

Pointwise Relevance

Method: Ask LLM to judge each document independently.

prompt = f"""
Given the query: "{query}"
And the document: "{document}"

Is this document relevant to the query?
Answer with only "Yes" or "No".
"""

response = llm.generate(prompt)
score = 1.0 if response == "Yes" else 0.0

Problem: No relative comparison, binary scores limit ranking.

Pairwise Comparison

Method: Ask LLM to compare pairs of documents.

prompt = f"""
Query: "{query}"

Document A: "{doc_a}"
Document B: "{doc_b}"

Which document is more relevant to the query?
Answer with "A" or "B".
"""

response = llm.generate(prompt)
# Build ranking via pairwise comparisons (like bubble sort)

Advantage: Relative judgments are easier than absolute scores.

Problem: Requires O(n²) comparisons for n documents.

Listwise Ranking

Method: Ask LLM to rank entire list at once.

prompt = f"""
Query: "{query}"

Rank the following documents by relevance:
[1] {doc_1}
[2] {doc_2}
[3] {doc_3}
...
[10] {doc_10}

Provide the ranking as a list of numbers (e.g., [3, 1, 7, ...]).
"""

response = llm.generate(prompt)
# Parse: [3, 1, 7, ...] means doc 3 is most relevant

Advantage: Single LLM call, considers all documents together.

Problem: Performance degrades with >20 documents (context length, attention issues).

Sliding Window Approach

Method: For large candidate sets, use sliding window.

# RankGPT approach
window_size = 20
sorted_docs = candidates.copy()

for i in range(0, len(candidates), window_size):
    window = sorted_docs[i:i+window_size]
    reranked_window = llm.listwise_rank(query, window)
    sorted_docs[i:i+window_size] = reranked_window

# Refine with multiple passes
for pass_num in range(3):
    sorted_docs = sliding_window_rank(sorted_docs, window_size)

Cost vs Performance

LLM Re-ranker Trade-offs

Model

Latency (100 docs)

Cost (100 docs)

Accuracy

Zero-shot?

MiniLM Cross-encoder

~2s

$0.000

Good

No (needs training)

GPT-3.5 (pointwise)

~60s

$0.20

Better

Yes

GPT-4 (pointwise)

~120s

$2.00

Best

Yes

GPT-3.5 (listwise)

~10s

$0.05

Better

Yes

GPT-4 (listwise)

~20s

$0.50

Best

Yes

RankLlama (self-hosted)

~30s

$0.000

Good

Yes

Key Insight: LLMs are 10-100x more expensive than trained cross-encoders but offer zero-shot capability.

When to Use LLM Re-rankers

Use LLM Re-rankers When:

  • No training data available (pure zero-shot)

  • Need explainability (LLM can explain why doc is relevant)

  • Domain shifts frequently (no time to retrain)

  • Budget allows ($0.10-$1.00 per query acceptable)

  • Accuracy is paramount (research, high-value queries)

Don’t Use When:

  • Serving millions of queries (cost prohibitive)

  • Latency < 5s required (LLMs too slow)

  • Have good training data (cross-encoder is better value)

  • Queries are simple (overkill)

Practical Implementation

Cost Optimization: Staged LLM Re-ranking

# Stage 1: Bi-encoder (10M → 1000)
candidates_1k = bi_encoder.search(query, corpus, top_k=1000)

# Stage 2: Fast cross-encoder (1000 → 100)
candidates_100 = cross_encoder.rerank(query, candidates_1k, top_k=100)

# Stage 3: LLM re-ranking (100 → 10)
# Only use expensive LLM on final 100
final_10 = llm_reranker.rerank(query, candidates_100, top_k=10)

Cost: 100 docs × $0.0005 = $0.05 per query (vs $5.00 if LLM on all 10K)

Caching for Repeated Queries

import hashlib

def get_cache_key(query, doc):
    return hashlib.md5(f"{query}::{doc}".encode()).hexdigest()

cache = {}  # Persistent cache (Redis, DynamoDB, etc.)

if cache_key in cache:
    score = cache[cache_key]  # Free!
else:
    score = llm.rank(query, doc)  # Expensive
    cache[cache_key] = score

Batch Processing for Cost

# Instead of real-time, batch queries
queries_batch = collect_queries_for_10_minutes()

# Send all to LLM in one request (cheaper bulk pricing)
all_pairs = [(q, d) for q in queries_batch for d in candidates[q]]
all_scores = llm.batch_predict(all_pairs)  # Bulk API rates

Best Practices

Prompt Engineering

# ❌ Bad prompt (ambiguous)
"Is this relevant?"

# ✅ Good prompt (clear instructions)
"You are a search quality evaluator. Given the query and document below,
determine if the document directly answers the query or provides the
information the user is looking for. Consider factual accuracy and
completeness. Respond with only 'Relevant' or 'Not Relevant'.

Query: {query}
Document: {document}

Judgment:"

Few-Shot Examples

Include examples in prompt for better calibration:

Here are examples of relevant and irrelevant documents:

Example 1:
Query: "What is photosynthesis?"
Document: "Photosynthesis is the process by which plants convert sunlight to energy."
Judgment: Relevant

Example 2:
Query: "What is photosynthesis?"
Document: "Plants are green and grow in soil."
Judgment: Not Relevant

Now judge this pair:
Query: "{query}"
Document: "{document}"
Judgment:

Future Directions

Active Research Areas:

  1. Distillation: Train small cross-encoder to mimic LLM judgments

  2. Efficient Prompting: Compress documents before passing to LLM

  3. Hybrid Scoring: Combine LLM with traditional cross-encoder

  4. Explanation Generation: Use LLM to explain ranking decisions

  5. Multi-modal: LLM re-rankers for images, videos

Emerging Models:

  • RankLlama: LLama-2 fine-tuned specifically for ranking

  • RankGPT: GPT-based with specialized prompting

  • PRP: Pairwise Ranking Prompting

  • Self-Consistency: Multiple LLM calls then vote

Next Steps