Drawing Conclusions from Draws: Rethinking Preference Semantics in Arena-Style LLM Evaluation

Published 2 Oct 2025 in cs.CL | (2510.02306v1)

Abstract: In arena-style evaluation of LLMs, two LLMs respond to a user query, and the user chooses the winning response or deems the "battle" a draw, resulting in an adjustment to the ratings of both models. The prevailing approach for modeling these rating dynamics is to view battles as two-player game matches, as in chess, and apply the Elo rating system and its derivatives. In this paper, we critically examine this paradigm. Specifically, we question whether a draw genuinely means that the two models are equal and hence whether their ratings should be equalized. Instead, we conjecture that draws are more indicative of query difficulty: if the query is too easy, then both models are more likely to succeed equally. On three real-world arena datasets, we show that ignoring rating updates for draws yields a 1-3% relative increase in battle outcome prediction accuracy (which includes draws) for all four rating systems studied. Further analyses suggest that draws occur more for queries rated as very easy and those as highly objective, with risk ratios of 1.37 and 1.35, respectively. We recommend future rating systems to reconsider existing draw semantics and to account for query properties in rating updates.

Abstract PDF Upgrade to Chat

Summary

The paper's central finding is that draws in arena-style LLM evaluations stem from query difficulty and subjectivity rather than model parity.
Ignoring draw updates in rating systems improves outcome prediction accuracy by 1–3%, with the largest gain observed using the Elo system.
The study encourages the development of query-aware rating models that incorporate query difficulty and objectivity to refine LLM performance comparisons.

Rethinking Draw Semantics in Arena-Style LLM Evaluation

Introduction

This paper critically examines the prevailing assumption in arena-style LLM evaluation that draws between models indicate skill parity. Arena-style evaluation, as popularized by platforms like Chatbot Arena, involves users issuing queries to two LLMs and selecting the superior response or declaring a draw. The standard approach models these interactions as two-player zero-sum games, applying rating systems such as Elo, Glicko-2, Bradley–Terry, and TrueSkill, where draws typically result in rating equalization. The authors challenge this paradigm, hypothesizing that draws are more reflective of query properties—specifically, difficulty and subjectivity—rather than model equivalence.

Methodology

Rating System Analysis

The study evaluates four established rating systems:

Elo: Logistic model for win probabilities, updates ratings based on outcomes with a fixed learning rate.
Glicko-2: Extends Elo by modeling rating deviation and volatility, scaling updates with uncertainty.
Bradley–Terry: Online logistic model, adopted by Chatbot Arena for stability.
TrueSkill: Bayesian system modeling ratings as Gaussian priors, updating via message passing.

For each system, the authors introduce a draw margin $\varepsilon$ to enable draw prediction, tuning this hyperparameter for optimal outcome prediction.

Experimental Setup

Three real-world datasets are used:

LMArena: 106K battles among 55 text-only LLMs.
SearchArena: 24K battles among 13 search-augmented LLM agents.
VisionArena: 30K battles among 17 vision–LLMs.

Draws constitute 30–40% of outcomes in each dataset. The primary evaluation metric is prequential battle outcome prediction accuracy, measuring the ability of current ratings to predict future outcomes.

Ablation Study

The central experiment omits rating updates for draws, comparing prediction accuracy against the baseline (including draw updates) and a random omission control (removing updates at the same rate as draws, but randomly across all outcomes).

Draw Semantics Analysis

A sample of 3,000 LMArena queries is annotated for difficulty and subjectivity (0–5 scale) using GPT-4.1. The risk ratio (RR) of observing a draw is computed for each bin. Additionally, the relationship between rating proximity and draw likelihood is analyzed across all battles.

Results

Impact of Ignoring Draw Updates

Omitting draw updates yields a 1–3% relative increase in outcome prediction accuracy across all rating systems and datasets, with statistical significance in the majority of cases. The effect is most pronounced for Elo (+3.0%), followed by Bradley–Terry (+1.1%), Glicko-2 (+0.7%), and TrueSkill (+0.5%). The random omission control shows negligible impact, confirming that the improvement is not due to reduced data usage.

Notably, ignoring draw updates improves both win/loss and draw prediction accuracy, as demonstrated by Pareto-superior trade-off curves when varying the draw margin $\varepsilon$ .

Draws and Query Properties

Draws are disproportionately associated with queries rated as very easy (difficulty = 0, RR = 1.37) and highly objective (subjectivity = 0, RR = 1.35). In contrast, rating proximity between models does not significantly predict draws; risk ratios remain near 1.0 except for the highest rating difference percentiles, which show a slight decrease (RR = 0.89–0.96).

These findings support the hypothesis that draws are primarily a function of query characteristics rather than model parity.

Implications

Practical

Rating System Design: The results indicate that current rating systems overinterpret draws as evidence of model equivalence. Future systems should reconsider draw semantics, potentially omitting draw-based updates or incorporating query-level features (difficulty, subjectivity) into rating adjustments.
Evaluation Protocols: Arena-style evaluation platforms should annotate or infer query properties to better contextualize draw outcomes, improving the reliability of model rankings.
Model Comparison: Researchers and practitioners should be cautious in interpreting draws as evidence of comparable model performance, especially in aggregate evaluations.

Theoretical

Skill Estimation: The findings challenge the analogy between LLM evaluation and competitive games like chess, where draws are meaningful indicators of skill parity. In LLM evaluation, draws are confounded by query properties, undermining the transitivity and reliability of rating systems.
Evaluation Robustness: The study complements prior work on rating system axioms and biases, highlighting a new axis of evaluation robustness—draw semantics.

Future Directions

Query-Aware Rating Systems: Develop rating systems that condition updates on query difficulty and subjectivity, possibly leveraging automated annotation via LLMs.
Multi-dimensional Evaluation: Move beyond single-score ratings to multi-dimensional skill profiles, capturing model strengths across varying query types.
Benchmark Design: Construct evaluation datasets with controlled distributions of query properties to disentangle model ability from query effects.

Conclusion

The paper demonstrates that draws in arena-style LLM evaluation are poor indicators of model parity, instead reflecting query difficulty and objectivity. Ignoring draw updates in rating systems consistently improves outcome prediction accuracy. These results call for a fundamental rethinking of draw semantics in LLM evaluation, advocating for the integration of query properties into rating updates and model comparison protocols.

Markdown Report Issue