Glicko-2 Rating System

Updated 1 February 2026

Glicko-2 is an advanced probabilistic model that estimates performance using dynamic rating deviation and a volatility parameter.
It updates player ratings after head-to-head encounters by applying Bayesian methods and functions like g(ϕ) to weigh match outcomes.
The system is applied in competitive gaming, machine learning benchmarks, and sports analytics to ensure fair, precise performance assessments.

The Glicko-2 rating system is an advanced probabilistic model for estimating and updating the latent performance of agents—players in games or classifiers in algorithmic competitions—through repeated pairwise encounters. It generalizes the Elo system to allow dynamic confidence intervals (rating deviation, RD) and a dedicated volatility parameter (σ), providing empirical estimates not only of comparative skill but also of both prediction reliability and consistency across changing contexts. It has become increasingly central to high-stakes analytics in domains ranging from competitive gaming to machine learning benchmarking and team sports ranking, with notable deployments in classifier tournaments (Cardoso et al., 2021, Bober-Irizar et al., 2024, Cardoso et al., 13 Apr 2025), football club analytics (Shelopugin et al., 2023), and esports matchmaking (Bober-Irizar et al., 2024).

1. Parameterization and Internal Scaling

Every agent (player, classifier, team) is represented at time t by three primary values:

$R$ (rating): a point estimate of current ability, typically initialized to 1500.
$RD$ (rating deviation): a standard deviation describing uncertainty in $R$ , initialized to 350.
$\sigma$ (volatility): quantifies temporal instability of skill, initialized to 0.06.

Glicko-2 operates on a rescaled, centered space:

$\mu = \frac{R - 1500}{173.7178}$ ,
$\phi = \frac{RD}{173.7178}$ ,

with $173.7178 = 400/\ln 10$ ensuring interpretability and comparability to Elo-derived systems. The volatility-change parameter $\tau > 0$ controls the responsiveness of volatility updates (typical range: $0.3$–$1.2$), and $RD$ 0 serves as the logistic base constant (Cardoso et al., 2021, Bober-Irizar et al., 2024, Cardoso et al., 13 Apr 2025).

2. Per-Period Update Mechanism

Agent ratings are updated at the end of discrete "periods," such as a set of games, a dataset-based tournament, or a seasonal batch. Each agent $RD$ 1 plays $RD$ 2 matches against opponents $RD$ 3; results $RD$ 4 indicate loss, draw, or win.

The update flow for agent $RD$ 5 proceeds:

Impact Function ( $RD$ 6):

$RD$ 7 Opponents with high RD are down-weighted in the update.

Expected Score:

$RD$ 8

Variance and Delta: $RD$ 9

$R$ 0

Volatility Update ( $R$ 1):

Find $R$ 2 such that $R$ 3 Typically solved via Brent’s or Illinois method.

Preliminary Deviation:

$R$ 4

New RD and Rating: $R$ 5

$R$ 6

Conversion Back:

$R$ 7 $R$ 8

$R$ 9

These equations rigorously propagate observed outcomes and inferred uncertainty through the system, supporting both head-to-head and round-robin competitive structures (Cardoso et al., 2021, Bober-Irizar et al., 2024, Cardoso et al., 13 Apr 2025).

3. Algorithmic and Statistical Rationale

Glicko-2’s statistical underpinnings reflect Bayesian updating where the prior for each agent is a normal with variance $\sigma$ 0, and actual match outcomes provide a sequence of (possibly noisy) observations. The $\sigma$ 1 function is a reliability dampener, assuring that outcomes against poorly measured opponents do not induce outsized rating changes. The $\sigma$ 2 term aggregates "surprise"—deviation between observed and expected scores—scaled by the information quality $\sigma$ 3. Volatility update is driven by the likelihood of the observed $\sigma$ 4 under the prior, ensuring compatibility with rating shocks (unexpected performance swings) (Bober-Irizar et al., 2024, Cardoso et al., 13 Apr 2025).

4. Application Domains and Adaptations

Glicko-2 is implemented as follows across representative domains:

Machine Learning Classifier Benchmarking:

Each dataset in the benchmark is a rating period. Item Response Theory (IRT) estimates classifier ability per instance difficulty; classifiers "compete" via head-to-head matches (S=1/0/0.5) based on true-score comparisons. The Glicko-2 update sequence refines $\sigma$ 5 per classifier after each dataset benchmark (Cardoso et al., 2021, Cardoso et al., 13 Apr 2025). Table: Mapping of Glicko-2 Entities in Classifier Benchmarking | Glicko-2 Term | Competition Context | Update Basis | |---------------|--------------------|---------------------------| | Player | Classifier | Dataset-based tournaments | | Match | Head-to-head eval | IRT true-score comparison | | Period | Dataset | One round-robin session |

Football League Analytics:

Each match is a rating period. Modifications include: explicit draw probabilities via a Poisson/LightGBM model; home-field advantage via added $\sigma$ 6 shift; league transitions via preseason parameter resets; rating inflation control via league average renormalization. The expected-score formula is adapted to model $\sigma$ 7 triplets (Shelopugin et al., 2023).

Esports (CS:GO):

Glicko-2 outperformed both Elo and TrueSkill at predicting non-draw professional match outcomes, with systematic gains in accuracy across training horizons and consistent parameterization using canonical defaults ( $\sigma$ 8, $\sigma$ 9, $\mu = \frac{R - 1500}{173.7178}$ 0, $\mu = \frac{R - 1500}{173.7178}$ 1). No domain-dependent tuning is required for $\mu = \frac{R - 1500}{173.7178}$ 2 or $\mu = \frac{R - 1500}{173.7178}$ 3 to attain robust skill separation (Bober-Irizar et al., 2024).

5. Pseudocode and Update Example

A canonical Glicko-2 update cycle (as used in classifier benchmarking):

$173.7178 = 400/\ln 10$ 0 (Cardoso et al., 13 Apr 2025)

A worked numerical example is provided in (Cardoso et al., 13 Apr 2025): two agents both begin with $\mu = \frac{R - 1500}{173.7178}$ 4, $\mu = \frac{R - 1500}{173.7178}$ 5, $\mu = \frac{R - 1500}{173.7178}$ 6, $\mu = \frac{R - 1500}{173.7178}$ 7; one wins a head-to-head match, increasing its rating to $\mu = \frac{R - 1500}{173.7178}$ 8 and decreasing $\mu = \frac{R - 1500}{173.7178}$ 9 significantly, while volatility $\phi = \frac{RD}{173.7178}$ 0 shows minute adjustment.

6. Interpretability and Domain-Specific Metrics

The triplet $\phi = \frac{RD}{173.7178}$ 1 supports a nuanced interpretation:

$\phi = \frac{RD}{173.7178}$ 2: best estimate of ability integrated over observed performance.
$\phi = \frac{RD}{173.7178}$ 3: reflects rating confidence; smaller $\phi = \frac{RD}{173.7178}$ 4 corresponds to higher certainty. A 95% interval may be reported as $\phi = \frac{RD}{173.7178}$ 5.
$\phi = \frac{RD}{173.7178}$ 6: captures inconsistency across periods; low $\phi = \frac{RD}{173.7178}$ 7 implies stable performance.

Glicko-2 provides statistically grounded, compact summary metrics for ability, reliability, and volatility, thereby enabling fine-grained decisions in both algorithmic benchmarking and competitive analytics (Cardoso et al., 2021, Bober-Irizar et al., 2024, Shelopugin et al., 2023, Cardoso et al., 13 Apr 2025).

7. Extensions, Modifications, and Practical Considerations

In football analytics, enhancements include probabilistic draws, explicit home/away modeling, league transitions, and log-loss-based hyperparameter optimization for improved predictive performance ((Shelopugin et al., 2023), github.com/andreyshelopugin/GlickoSoccer). In machine learning benchmarking, Glicko-2 facilitates fairer classifier comparison by integrating statistical difficulty (via IRT) and standardizing head-to-head ability assessment (Cardoso et al., 2021, Cardoso et al., 13 Apr 2025).

Parameter selection is generally robust; typical defaults yield effective results, and only targeted domains (football, multi-player esports) warrant extensive hyperparameter tuning. Insufficiently informative matches (low $\phi = \frac{RD}{173.7178}$ 8) do little to reduce uncertainty, which is reflected in persistently high $\phi = \frac{RD}{173.7178}$ 9. Systematic sensitivity analyses have validated the stability of Glicko-2 under standard settings compared to alternative rating models (Bober-Irizar et al., 2024).

For full algorithmic details, practical deployment strategies, and curated numerical examples, see (Cardoso et al., 2021, Bober-Irizar et al., 2024, Shelopugin et al., 2023), and (Cardoso et al., 13 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (4)

Data vs classifiers, who wins? (2021)

Skill Issues: An Analysis of CS:GO Skill Rating Systems (2024)

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness (2025)

Ratings of European and South American Football Leagues Based on Glicko-2 with Modifications (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Glicko-2 Rating System.