LLM-Based Re-rankers
====================

Large Language Models (LLMs) can perform re-ranking through prompting, offering zero-shot 
capability and explainability at the cost of higher latency and compute.

The LLM Re-ranking Paradigm
----------------------------

**Traditional Cross-Encoder:**

* Requires training on labeled (query, doc, relevance) data
* Fixed to specific task/domain
* Fast inference (~50ms per pair)
* No explanation

**LLM Re-ranker:**

* Zero-shot via prompting (no training needed)
* Generalizes across tasks
* Slower inference (~500-2000ms per pair)
* Can provide reasoning

Zero-Shot Prompting Approaches
-------------------------------

Pointwise Relevance
^^^^^^^^^^^^^^^^^^^

**Method**: Ask LLM to judge each document independently.

.. code-block:: python

   prompt = f"""
   Given the query: "{query}"
   And the document: "{document}"
   
   Is this document relevant to the query?
   Answer with only "Yes" or "No".
   """
   
   response = llm.generate(prompt)
   score = 1.0 if response == "Yes" else 0.0

**Problem**: No relative comparison, binary scores limit ranking.

Pairwise Comparison
^^^^^^^^^^^^^^^^^^^

**Method**: Ask LLM to compare pairs of documents.

.. code-block:: python

   prompt = f"""
   Query: "{query}"
   
   Document A: "{doc_a}"
   Document B: "{doc_b}"
   
   Which document is more relevant to the query?
   Answer with "A" or "B".
   """
   
   response = llm.generate(prompt)
   # Build ranking via pairwise comparisons (like bubble sort)

**Advantage**: Relative judgments are easier than absolute scores.

**Problem**: Requires O(n²) comparisons for n documents.

Listwise Ranking
^^^^^^^^^^^^^^^^

**Method**: Ask LLM to rank entire list at once.

.. code-block:: python

   prompt = f"""
   Query: "{query}"
   
   Rank the following documents by relevance:
   [1] {doc_1}
   [2] {doc_2}
   [3] {doc_3}
   ...
   [10] {doc_10}
   
   Provide the ranking as a list of numbers (e.g., [3, 1, 7, ...]).
   """
   
   response = llm.generate(prompt)
   # Parse: [3, 1, 7, ...] means doc 3 is most relevant

**Advantage**: Single LLM call, considers all documents together.

**Problem**: Performance degrades with >20 documents (context length, attention issues).

Sliding Window Approach
^^^^^^^^^^^^^^^^^^^^^^^

**Method**: For large candidate sets, use sliding window.

.. code-block:: python

   # RankGPT approach
   window_size = 20
   sorted_docs = candidates.copy()
   
   for i in range(0, len(candidates), window_size):
       window = sorted_docs[i:i+window_size]
       reranked_window = llm.listwise_rank(query, window)
       sorted_docs[i:i+window_size] = reranked_window
   
   # Refine with multiple passes
   for pass_num in range(3):
       sorted_docs = sliding_window_rank(sorted_docs, window_size)

Cost vs Performance
-------------------

.. list-table:: LLM Re-ranker Trade-offs
   :header-rows: 1
   :widths: 20 20 20 20 20

   * - Model
     - Latency (100 docs)
     - Cost (100 docs)
     - Accuracy
     - Zero-shot?
   * - MiniLM Cross-encoder
     - ~2s
     - $0.000
     - Good
     - No (needs training)
   * - GPT-3.5 (pointwise)
     - ~60s
     - $0.20
     - Better
     - Yes
   * - GPT-4 (pointwise)
     - ~120s
     - $2.00
     - Best
     - Yes
   * - GPT-3.5 (listwise)
     - ~10s
     - $0.05
     - Better
     - Yes
   * - GPT-4 (listwise)
     - ~20s
     - $0.50
     - Best
     - Yes
   * - RankLlama (self-hosted)
     - ~30s
     - $0.000
     - Good
     - Yes

**Key Insight**: LLMs are 10-100x more expensive than trained cross-encoders but offer 
zero-shot capability.

When to Use LLM Re-rankers
--------------------------

✅ **Use LLM Re-rankers When:**

* No training data available (pure zero-shot)
* Need explainability (LLM can explain why doc is relevant)
* Domain shifts frequently (no time to retrain)
* Budget allows ($0.10-$1.00 per query acceptable)
* Accuracy is paramount (research, high-value queries)

❌ **Don't Use When:**

* Serving millions of queries (cost prohibitive)
* Latency < 5s required (LLMs too slow)
* Have good training data (cross-encoder is better value)
* Queries are simple (overkill)

Practical Implementation
-------------------------

**Cost Optimization: Staged LLM Re-ranking**

.. code-block:: python

   # Stage 1: Bi-encoder (10M → 1000)
   candidates_1k = bi_encoder.search(query, corpus, top_k=1000)
   
   # Stage 2: Fast cross-encoder (1000 → 100)
   candidates_100 = cross_encoder.rerank(query, candidates_1k, top_k=100)
   
   # Stage 3: LLM re-ranking (100 → 10)
   # Only use expensive LLM on final 100
   final_10 = llm_reranker.rerank(query, candidates_100, top_k=10)

**Cost**: 100 docs × $0.0005 = $0.05 per query (vs $5.00 if LLM on all 10K)

**Caching for Repeated Queries**

.. code-block:: python

   import hashlib
   
   def get_cache_key(query, doc):
       return hashlib.md5(f"{query}::{doc}".encode()).hexdigest()
   
   cache = {}  # Persistent cache (Redis, DynamoDB, etc.)
   
   if cache_key in cache:
       score = cache[cache_key]  # Free!
   else:
       score = llm.rank(query, doc)  # Expensive
       cache[cache_key] = score

**Batch Processing for Cost**

.. code-block:: python

   # Instead of real-time, batch queries
   queries_batch = collect_queries_for_10_minutes()
   
   # Send all to LLM in one request (cheaper bulk pricing)
   all_pairs = [(q, d) for q in queries_batch for d in candidates[q]]
   all_scores = llm.batch_predict(all_pairs)  # Bulk API rates

Best Practices
--------------

**Prompt Engineering**

.. code-block:: text

   # ❌ Bad prompt (ambiguous)
   "Is this relevant?"
   
   # ✅ Good prompt (clear instructions)
   "You are a search quality evaluator. Given the query and document below, 
   determine if the document directly answers the query or provides the 
   information the user is looking for. Consider factual accuracy and 
   completeness. Respond with only 'Relevant' or 'Not Relevant'.
   
   Query: {query}
   Document: {document}
   
   Judgment:"

**Few-Shot Examples**

Include examples in prompt for better calibration:

.. code-block:: text

   Here are examples of relevant and irrelevant documents:
   
   Example 1:
   Query: "What is photosynthesis?"
   Document: "Photosynthesis is the process by which plants convert sunlight to energy."
   Judgment: Relevant
   
   Example 2:
   Query: "What is photosynthesis?"
   Document: "Plants are green and grow in soil."
   Judgment: Not Relevant
   
   Now judge this pair:
   Query: "{query}"
   Document: "{document}"
   Judgment:

Future Directions
-----------------

**Active Research Areas:**

1. **Distillation**: Train small cross-encoder to mimic LLM judgments
2. **Efficient Prompting**: Compress documents before passing to LLM
3. **Hybrid Scoring**: Combine LLM with traditional cross-encoder
4. **Explanation Generation**: Use LLM to explain ranking decisions
5. **Multi-modal**: LLM re-rankers for images, videos

**Emerging Models:**

* RankLlama: LLama-2 fine-tuned specifically for ranking
* RankGPT: GPT-based with specialized prompting
* PRP: Pairwise Ranking Prompting
* Self-Consistency: Multiple LLM calls then vote

Next Steps
----------

* See :doc:`cross_encoders` for traditional trained re-rankers
* See :doc:`../stage1_retrieval/hard_mining` for improving training data
* See :doc:`../stage1_retrieval/late_interaction` for ColBERT alternative