zELO: ELO-inspired Training Method for Rerankers and Embedding Models¶
Paper Information¶
Title: zELO: ELO-inspired Training Method for Rerankers and Embedding Models
Authors: Nicholas Pipitone, Ghita Houir Alami, Advaith Avadhanam, Anton Kaminskyi, Ashley Khoo (ZeroEntropy)
Publication: arXiv:2509.12541v1 [cs.AI], September 16, 2025
Models Released: zerank-1 (Qwen3-4B based), zerank-1-small (Qwen3-1.7B based)
Code/Benchmark: https://github.com/zeroentropy-ai/zbench
Abstract¶
zELO introduces a novel training methodology that optimizes retrieval performance by recognizing that ranking tasks are statistically equivalent to a Thurstone model. The method uses unsupervised data to train state-of-the-art open-weight reranker models (zerank-1 and zerank-1-small) that achieve the highest retrieval scores across multiple domains including finance, legal, code, and STEM—outperforming closed-source proprietary rerankers on both NDCG@10 and Recall.
Key training statistics:
112,000 queries with 100 documents per query
Over 5 million query-zELO pairs
Trained end-to-end in less than 10,000 H100-hours
No human annotations required
Motivation: The Laffer Curve Problem¶
The Fundamental Constraint on Hard Negative Mining¶
The authors identify a critical limitation in existing SOTA hard negative mining techniques. Experimentally, they found that making the “hard negative miner” increasingly intelligent eventually degrades model performance rather than improving it.
The Core Problem: Manual inspection revealed that hard negatives were, on average, legitimately more relevant than human-annotated positives. This occurs because:
Humans cannot exhaustively scan an entire corpus
SOTA methods like LLM-ensemble rerankers can reason on a much larger knowledge base than even expert annotators
These methods can identify relevant documents at scale that human annotators would miss
The Laffer Curve Analogy¶
The relationship between hard negative miner intelligence and student model performance follows a Laffer curve pattern:
Student Model Performance
^
|
| * Optimal Point
| /\
| / \
| / \
| / \
| / \
| / \
| / \
|/ \
+-----------------> Miner Intelligence
As ensemble-generated hard negatives approach and exceed the quality of human-positive annotations, the marginal benefit from distillation diminishes and eventually becomes negative.
Key Insight: The highest possible pointwise reranker performance is NOT that which corresponds to the optimal point on this Laffer curve. Hard negative mining is fundamentally flawed, and its accuracy is capped by the training algorithm itself.
The Intractability of False Negatives¶
While one could human-annotate triplets \((q, d^-, d^+)\) to confirm \(d^-\) as a true negative relative to the positive, this is inherently a pairwise comparison. For pointwise models, absolute scoring via InfoNCE requires in-batch negatives, which necessitates an unsupervised negative sampling strategy.
This motivates the zELO approach: using pairwise annotations from LLM ensembles and converting them to absolute relevance scores via the Elo/Thurstone framework.
The zELO Method¶
Core Definitions¶
Pointwise Reranker: A function \(R_{point}: Q \times D \rightarrow [0, 1]\) such that for a query \(q\) and corpus \(C = \{d_1, \ldots, d_n\}\), if \(i_1, \ldots, i_n\) is the relevance ranking:
Pairwise Reranker: A function \(R_{pair}: Q \times D \times D \rightarrow [0, 1]\) where:
represents the probability that document \(d_i\) is preferred over document \(d_j\) for query \(q\).
Multi-Stage Training Pipeline¶
The zELO method consists of four main stages:
- Stage 1: Candidate Generation
Generate candidate documents using a first-stage retriever (e.g., hybrid search combining embeddings with BM25). Top-k = 100 documents are retrieved per query.
- Stage 2: Pairwise Preference Collection
Gather sparse pairwise preferences from an ensemble of LLMs (|P| = 3 frontier models). For efficiency, a pairwise SLM is distilled from the LLM ensemble.
- Stage 3: Elo Score Computation
Convert pairwise preferences to absolute relevance scores using the Thurstone statistical model via the Bradley-Terry framework.
- Stage 4: Pointwise Reranker Training
Fine-tune pointwise rerankers on the computed zELO scores using MSE loss.
Bradley-Terry Model Connection¶
The Bradley-Terry model assumes that for documents \(d_i\) and \(d_j\) with latent abilities \(\pi_i\) and \(\pi_j\):
In the Elo formulation, parameterizing \(\pi_i = e^{Elo_i}\):
The Elo scores are fit via gradient descent using negative log-likelihood loss:
Subject to the constraint \(\sum_i Elo_i = 0\) for normalization.
Thurstone Model Extension¶
The Thurstone model provides a better fit by assuming document noise follows a normal distribution (rather than Gumbel):
This is justified via the central limit theorem, given that document comparisons are subject to multiple sources of noise.
Sparse Matrix Subsampling¶
Graph Construction for Efficient Elo Estimation¶
Dense \(n \times n\) pairwise inference is prohibitively expensive. The method uses sparse sampling with \(O(n)\) pairwise comparisons while maintaining accurate Elo estimates.
Three Key Properties for Graph G:
Connectivity: The graph must be connected to establish relative Elo relationships between all documents.
Minimum Degree: No nodes should have very low degree (Var[\(e'_i\)] \(\propto 1/\text{deg}(d_i)\)).
Low Diameter: Maximum separation should be small (Var[\(e'_i - e'_j\)] \(\propto \text{dist}_G(d_i, d_j)\)).
Random k-Regular Graph via Cycle Splicing¶
The optimal solution uses \(k/2\) random n-cycles with their edge sets unioned:
Step 1: Generate k/2 random cycles over the vertex set
Step 2: Overlay the cycles to create a k-regular graph
Result: k-connected graph with N = kn/2 edges
Low diameter: O(log_{k-1}(n))
Uniform degree distribution
For a random k-regular graph G:
with probability asymptotically 1 (Bollobás 2001).
Final Configuration: N = 400 inferences (0.4% of full 100×100 matrix) with k = 8 (4 random cycles).
Training Details¶
Dataset¶
Queries: 112,000 publicly available queries across finance, law, medicine, code, and STEM
Documents: >100M publicly available web-scale documents
Initial Retrieval: Qwen3-Embedding-4B embeddings combined via RRF with lexical BM25
Top-k: 100 documents per query
Ensemble Annotation¶
An ensemble of |P| = 3 frontier LLMs generates pairwise preferences:
Each LLM outputs chain-of-thought justification and preference score on [-1, 1]
Scores are clamped to {-1, 0, 1} and averaged
Document order is randomized to mitigate position bias
Key Finding: Ensembles of LLMs via zELO generate higher quality data than equivalent human annotators on average.
Binary Cross-Entropy Loss for Pairwise Training¶
Where:
Pointwise Reranker Training¶
Standard MSE loss for supervised fine-tuning:
Where \(y\) values are the computed zELO scores.
RLHF Refinement Stage¶
A second training pass adds data based on pointwise reranker failures:
For each query, identify \(d_{human}\) (highest human-annotated document)
Let \(r_{human}\) = rank of this document by the trained pointwise reranker
If \(r_{human} > t\) (threshold), this is a failure
Inference the pairwise ensemble on \((d_{human}, d')\) where \(d'\) is ranked at position \(r_{human} - 1\)
Add this comparison to training data and retrain
This recaptures signal that pure LLM-ensemble distillation missed while using high-SNR human annotations.
Results¶
Public Benchmark Performance (NDCG@10)¶
Task |
Default(emb) |
Cohere rerank-v3.5 |
Salesforce/Llama-rank |
zerank-1-small |
zerank-1 |
|---|---|---|---|---|---|
Code |
0.678 |
0.724 |
0.694 |
0.730 |
0.754 |
Conversational |
0.250 |
0.571 |
0.484 |
0.556 |
0.596 |
Finance |
0.839 |
0.824 |
0.828 |
0.861 |
0.894 |
Legal |
0.703 |
0.804 |
0.767 |
0.817 |
0.821 |
Medical |
0.619 |
0.750 |
0.719 |
0.773 |
0.796 |
STEM |
0.401 |
0.510 |
0.595 |
0.680 |
0.694 |
Private Dataset Performance (NDCG@10)¶
Task |
Cohere rerank-v3.5 |
Salesforce/Llama-rank |
VoyageAI/rerank-2 |
zerank-1-small |
zerank-1 |
|---|---|---|---|---|---|
Legal |
0.718 |
0.766 |
0.746 |
0.799 |
0.854 |
Enterprise Search |
0.674 |
0.629 |
0.735 |
0.765 |
0.799 |
Conversational |
0.727 |
0.653 |
0.727 |
0.747 |
0.787 |
Healthcare |
0.706 |
0.756 |
0.749 |
0.885 |
0.898 |
Key Observation: Margins improve on private datasets, indicating high generalization and low overfitting to public benchmarks.
Latency Comparison¶
Model |
Latency (12 KB) |
Latency (150 KB) |
|
|---|---|---|---|
Jina m0 |
0.7279 |
547.14 ± 66.84 ms |
2,543.8 ± 2,984.9 ms |
Cohere 3.5 |
0.7091 |
171.5 ± 106.8 ms |
459.2 ± 87.9 ms |
zerank-1 |
0.7683 |
149.7 ± 53.1 ms |
314.4 ± 94.6 ms |
Key Contributions¶
Novel Theoretical Framework: Identification of the Laffer curve limitation in hard negative mining, explaining why increasingly sophisticated miners eventually degrade performance.
zELO Training Pipeline: A multi-stage approach that bypasses the Laffer curve by using pairwise comparisons and Elo-based scoring rather than pointwise annotations with hard negatives.
Unsupervised Data Generation: Demonstration that LLM ensembles via zELO generate higher quality training data than human annotators.
Efficient Sparse Sampling: k-regular graph construction via cycle splicing achieves accurate Elo estimation with only 0.4% of full pairwise comparisons.
State-of-the-Art Models: zerank-1 and zerank-1-small achieve best-in-class performance across multiple domains while maintaining competitive latency.
Practical Applications¶
Automated Benchmarking: zELO can benchmark internal private documents without human annotation
Domain-Specific Fine-tuning: Generate domain-specific training data automatically
Live Production Evaluation: Randomly sample live query logs, annotate via zELO, and discover/fix retrieval issues
Continuous Improvement: Use zELO annotations to fine-tune rerankers on production data
Model Availability¶
zerank-1: https://huggingface.co/zeroentropy/zerank-1 (Open-weight, ZeroEntropy license)
zerank-1-small: https://huggingface.co/zeroentropy/zerank-1-small (Apache 2.0 License)
Evaluation Pipeline: https://github.com/zeroentropy-ai/zbench