Ranked Reward (R2) Algorithm

Updated 14 January 2026

Ranked Reward (R2) algorithm is a reinforcement learning framework that reshapes rewards based on the ranking of recent performance.
It uses a sliding window and percentile threshold to dynamically calibrate rewards, emulating self-play dynamics in non-adversarial settings.
Applications span combinatorial optimization, sparse-reward games, and imitation learning from demonstrations and passive video.

The Ranked Reward (R2) algorithm is a reinforcement learning framework developed to enable self-play techniques, which have shown success in two-player games such as Go and Chess, to be applied to single-agent environments and combinatorial optimization problems. By ranking recent agent performance and reshaping rewards according to relative, rather than absolute, achievement, R2 creates an adaptive curriculum that preserves the crucial moving-target property of self-play for problems lacking natural adversaries. Applications span from combinatorial optimization (e.g., bin packing) to sparse-reward single-player games (e.g., Morpion Solitaire), and reward modeling from demonstrations, including action-free passive video.

1. Motivation and Conceptual Underpinnings

Traditional self-play reinforcement learning leverages the adversarial setup of competitive games, ensuring that an agent's training distribution adaptively matches its own skill level, thus maintaining an effective curriculum. In single-agent environments or optimization tasks, such natural adversarial structure is absent. The R2 algorithm addresses this by introducing a sliding-window mechanism: the agent's returns on recent episodes are ranked, and subsequent episode rewards are reshaped into a binary signal (+1 or −1) based on whether current performance exceeds a percentile threshold of this buffer. This "self-competition" dynamically calibrates reward difficulty, emulating self-play's curriculum properties in the single-agent regime (Laterre et al., 2018, Wang et al., 2020).

2. Formal Algorithmic Definition

Let episode returns be denoted $r_T \in \mathbb{R}$ . Maintain a buffer $B = \{ r^{(1)}, \ldots, r^{(L)} \}$ of the latest $L$ raw scores. Sorting $B$ , the $\alpha$ -quantile (e.g., 75th percentile) is $r_\alpha$ . Define the ranked reward $z(r_T; B)$ as:

$z(r_T;B) = \begin{cases} +1 & \text{if}\quad r_T > r_\alpha \ -1 & \text{if}\quad r_T < r_\alpha \ u \sim \text{Uniform}\{-1,+1\} & \text{if}\quad r_T = r_\alpha \end{cases}$

This label $z$ replaces the raw episodic return in both value learning and target propagation during MCTS or any RL update. For each new episode, the buffer is updated, and the threshold $r_\alpha$ is recalculated, establishing a moving baseline. The selection of buffer length $L$ and percentile $\alpha$ tunes the frequency and severity of reward, balancing curriculum strictness and adaptability (Laterre et al., 2018, Wang et al., 2020).

3. Integration with Deep RL and MCTS

R2 is integrated into AlphaZero-style self-play reinforcement learning. The overall loop involves alternating phases of self-play, during which games are generated using current policy (optionally with MCTS), and learning, during which network parameters are updated. For each time step in an episode, training tuples $(s_t, \pi_t, z)$ —comprising state, improved policy (from MCTS visit counts or directly from the policy head), and the ranked reward—are collected into a replay buffer.

The combined loss function is:

$l(\theta) = \mathbb{E}_{(s, \pi, z)} \left[ (v_\theta(s) - z)^2 - \pi^T \log p_\theta(s) \right] + \text{regularization}$

This approach was applied to Morpion Solitaire, where sparse, unbounded rewards (number of moves) preclude direct win/loss signals; the ranked-reward signal provides dense supervision and drives learning toward surpassing the moving performance threshold (Wang et al., 2020).

4. Hyperparameters and Curriculum Effects

Key parameters include:

Parameter	Typical Value	Effect
Buffer size $L$	200–250	Smoothing vs. rapid adaptation
Percentile $\alpha$	0.5–0.9	Stringency of positive reward, curriculum strictness
Episodes/Iter	50	Sample efficiency
MCTS sims	100–300 (self-play), 20000 (eval)	Policy improvement breadth/depth

A higher $\alpha$ (e.g., 0.9) means only the top 10% of games yield positive rewards, intensifying selection pressure toward elite performance but risking sparse feedback. Smaller buffer sizes cause the threshold to adapt quickly, potentially overreacting to noise or outliers, while large $L$ smooths but slows updates. Improper choices (e.g., $|B|$ too small or $\alpha$ too high) can stall learning as nearly all episodes are penalized (Laterre et al., 2018, Wang et al., 2020).

5. Extensions to Imitation Learning and Video-based Reward Shaping

The ranked-reward concept generalizes to learning from demonstrations. D-REX (Disturbance-based Reward Extrapolation) generates automatically ranked demonstrations via controlled performance degradation: noise injection into a behavioral cloning policy produces a monotonic difficulty scale. Pairwise ranking between trajectories, modeled via logistic loss, enables learning a reward function that—when optimized using standard RL—yields policies often surpassing the best demonstrator by reducing reward ambiguity and enabling performance extrapolation (Brown et al., 2019).

Rank2Reward extends R2 to passive video. Here, a ranking network $u_\theta(\cdot)$ , trained by temporal frame-ordering loss on demonstration videos, assigns progress scores $p_{RF}(s) = \sigma(u_\theta(s))$ . This score, optionally regularized by adversarial distribution matching (KL or GAIL-style objectives), serves as a shaped reward for RL, allowing robots to learn tasks from raw video without low-level action/state data. This framework has demonstrated sample-efficient learning and accelerates progress on both simulated and real robotic tasks, with scaling demonstrated on web-scale datasets (Yang et al., 2024).

6. Empirical Results and Application Domains

R2 has demonstrated substantial empirical gains:

In 2D/3D bin packing, R2 outperforms vanilla MCTS, policy supervision, domain heuristics, and commercial IP solvers, with advantage increasing for larger $N$ (up to 15% better costs relative to Gurobi under fixed time) (Laterre et al., 2018).
For Morpion Solitaire, R2-enabled AlphaZero reaches a 67-step solution (vs. human best 68), within one week on commodity GPUs, using only the ranked-reward mechanism for dense feedback (Wang et al., 2020).
D-REX yields better-than-demonstrator policies in MuJoCo and Atari, outperforming BC and GAIL in most tasks (e.g., +418% above demo on Half-Cheetah) (Brown et al., 2019).
Rank2Reward accelerates imitation learning from passive video, solves all tasks in under 2 hours of real robot time, and demonstrates extractable, monotonic progress signals even in large, uncurated datasets (Yang et al., 2024).

7. Limitations, Theoretical Insights, and Future Directions

R2's performance is sensitive to instance difficulty heterogeneity: if the distribution of problem instances is non-uniform, reward ranking may be noisy, and poor outcomes on harder instances can be mis-ranked against easy ones. Proposed mitigations include repeated play on the same instance or explicit normalization. Theoretical analysis demonstrates that ranking induces exponential reductions in reward ambiguity, facilitating reward learning with relatively few preference pairs (Brown et al., 2019).

Extensions to continuous ranked rewards, explicit instance difficulty modeling, and further problem domains are active areas of investigation. The ranked-reward principle has proven effective across diverse settings, providing a self-play analogue for single-agent RL and data-driven imitation in settings where rewards are sparse, unnormalized, or unavailable.