- The paper's central finding is that draws in arena-style LLM evaluations stem from query difficulty and subjectivity rather than model parity.
- Ignoring draw updates in rating systems improves outcome prediction accuracy by 1–3%, with the largest gain observed using the Elo system.
- The study encourages the development of query-aware rating models that incorporate query difficulty and objectivity to refine LLM performance comparisons.
Rethinking Draw Semantics in Arena-Style LLM Evaluation
Introduction
This paper critically examines the prevailing assumption in arena-style LLM evaluation that draws between models indicate skill parity. Arena-style evaluation, as popularized by platforms like Chatbot Arena, involves users issuing queries to two LLMs and selecting the superior response or declaring a draw. The standard approach models these interactions as two-player zero-sum games, applying rating systems such as Elo, Glicko-2, Bradley–Terry, and TrueSkill, where draws typically result in rating equalization. The authors challenge this paradigm, hypothesizing that draws are more reflective of query properties—specifically, difficulty and subjectivity—rather than model equivalence.
Methodology
Rating System Analysis
The study evaluates four established rating systems:
- Elo: Logistic model for win probabilities, updates ratings based on outcomes with a fixed learning rate.
- Glicko-2: Extends Elo by modeling rating deviation and volatility, scaling updates with uncertainty.
- Bradley–Terry: Online logistic model, adopted by Chatbot Arena for stability.
- TrueSkill: Bayesian system modeling ratings as Gaussian priors, updating via message passing.
For each system, the authors introduce a draw margin ε to enable draw prediction, tuning this hyperparameter for optimal outcome prediction.
Experimental Setup
Three real-world datasets are used:
- LMArena: 106K battles among 55 text-only LLMs.
- SearchArena: 24K battles among 13 search-augmented LLM agents.
- VisionArena: 30K battles among 17 vision–LLMs.
Draws constitute 30–40% of outcomes in each dataset. The primary evaluation metric is prequential battle outcome prediction accuracy, measuring the ability of current ratings to predict future outcomes.
Ablation Study
The central experiment omits rating updates for draws, comparing prediction accuracy against the baseline (including draw updates) and a random omission control (removing updates at the same rate as draws, but randomly across all outcomes).
Draw Semantics Analysis
A sample of 3,000 LMArena queries is annotated for difficulty and subjectivity (0–5 scale) using GPT-4.1. The risk ratio (RR) of observing a draw is computed for each bin. Additionally, the relationship between rating proximity and draw likelihood is analyzed across all battles.
Results
Impact of Ignoring Draw Updates
Omitting draw updates yields a 1–3% relative increase in outcome prediction accuracy across all rating systems and datasets, with statistical significance in the majority of cases. The effect is most pronounced for Elo (+3.0%), followed by Bradley–Terry (+1.1%), Glicko-2 (+0.7%), and TrueSkill (+0.5%). The random omission control shows negligible impact, confirming that the improvement is not due to reduced data usage.
Notably, ignoring draw updates improves both win/loss and draw prediction accuracy, as demonstrated by Pareto-superior trade-off curves when varying the draw margin ε.
Draws and Query Properties
Draws are disproportionately associated with queries rated as very easy (difficulty = 0, RR = 1.37) and highly objective (subjectivity = 0, RR = 1.35). In contrast, rating proximity between models does not significantly predict draws; risk ratios remain near 1.0 except for the highest rating difference percentiles, which show a slight decrease (RR = 0.89–0.96).
These findings support the hypothesis that draws are primarily a function of query characteristics rather than model parity.
Implications
Practical
- Rating System Design: The results indicate that current rating systems overinterpret draws as evidence of model equivalence. Future systems should reconsider draw semantics, potentially omitting draw-based updates or incorporating query-level features (difficulty, subjectivity) into rating adjustments.
- Evaluation Protocols: Arena-style evaluation platforms should annotate or infer query properties to better contextualize draw outcomes, improving the reliability of model rankings.
- Model Comparison: Researchers and practitioners should be cautious in interpreting draws as evidence of comparable model performance, especially in aggregate evaluations.
Theoretical
- Skill Estimation: The findings challenge the analogy between LLM evaluation and competitive games like chess, where draws are meaningful indicators of skill parity. In LLM evaluation, draws are confounded by query properties, undermining the transitivity and reliability of rating systems.
- Evaluation Robustness: The study complements prior work on rating system axioms and biases, highlighting a new axis of evaluation robustness—draw semantics.
Future Directions
- Query-Aware Rating Systems: Develop rating systems that condition updates on query difficulty and subjectivity, possibly leveraging automated annotation via LLMs.
- Multi-dimensional Evaluation: Move beyond single-score ratings to multi-dimensional skill profiles, capturing model strengths across varying query types.
- Benchmark Design: Construct evaluation datasets with controlled distributions of query properties to disentangle model ability from query effects.
Conclusion
The paper demonstrates that draws in arena-style LLM evaluation are poor indicators of model parity, instead reflecting query difficulty and objectivity. Ignoring draw updates in rating systems consistently improves outcome prediction accuracy. These results call for a fundamental rethinking of draw semantics in LLM evaluation, advocating for the integration of query properties into rating updates and model comparison protocols.