Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rollout-GDRO: Dynamic Compute Allocation

Updated 28 January 2026
  • Rollout-GDRO is a compute-neutral strategy that dynamically allocates policy rollouts based on prompt difficulty to efficiently reduce gradient variance.
  • The method employs a game-theoretic formulation with a no-regret primal–dual controller that adapts rollout counts, yielding 9-11% accuracy gains in hard reasoning tasks.
  • Empirical results demonstrate significant robustness improvements with reduced variance proxies, ensuring optimal use of a fixed mean rollout budget in LLM post-training.

Rollout-GDRO is a compute-neutral allocation strategy introduced within the Group Distributionally Robust Optimization (GDRO) framework for reinforcement learning post-training of LLMs. It dynamically reallocates the number of policy rollouts per prompt across difficulty-defined groups, maximizing gradient variance reduction for hard reasoning tasks under a strict mean rollout budget. Rollout-GDRO contrasts with standard uniform rollout strategies by adapting computation to the evolving difficulty distribution of prompts, yielding significant robustness gains on heterogeneous, heavy-tailed reasoning data (Panaganti et al., 27 Jan 2026).

1. Motivation for Dynamic Rollout Allocation

Traditional Group Relative Policy Optimization (GRPO) allocates a fixed number of rollouts nn per sampled prompt, maintaining uniform exploration across the dataset. In LLM reasoning, the data distribution is heavy-tailed: most prompts become “solved” (i.e., exhibit low variance) early, whereas a minority of hard prompts continue to present high estimation uncertainty. Persisting with uniform rollout allocation leads to wasted compute on simple, well-understood prompts and inadequate coverage of the hard tail, thereby hindering policy improvement precisely where it is most needed. Rollout-GDRO addresses this inefficiency by making the rollout count a function of the prompt's difficulty group, while strictly preserving the total mean rollout budget nˉ\bar n.

2. Game-Theoretic Formulation of Rollout-GDRO

At each training step, an online pass@k classifier partitions prompts into BB dynamic “difficulty bins.” Let q^t(b)\hat q_t(b) denote the empirical proportion of prompts in bin bb for the current batch. The goal is to allocate, for each bin bb, an integer number of rollouts nb{nmin,,nmax}n_b \in \{ n_{\min}, \dots, n_{\max} \} so as to maximize empirical utility under a compute constraint: max{nb}b=1Bq^t(b)J^b(θ;nb)s.t.b=1Bq^t(b)nb=nˉ,\max_{\{n_b\}} \sum_{b=1}^B \hat q_t(b)\, \hat J_b(\theta; n_b) \quad \text{s.t.} \quad \sum_{b=1}^B \hat q_t(b)\, n_b = \bar n, where J^b(θ;nb)\hat J_b(\theta; n_b) is the empirical bin-utility (negative loss) defined as

J^b(θ;nb)=1Bb,txiBb,t[1nbj=1nbi,j(θ)],\hat J_b(\theta; n_b) = -\frac{1}{|\mathcal{B}_{b,t}|} \sum_{x_i \in \mathcal{B}_{b,t}} \left[ \frac{1}{n_b} \sum_{j=1}^{n_b} \ell_{i,j}(\theta) \right],

with i,j(θ)\ell_{i,j}(\theta) the GRPO loss for rollout jj and prompt xix_i. The corresponding Lagrangian introduces a shadow price μ\mu (compute dual variable): L({nb},μ)=b=1Bq^t(b)J^b(θ;nb)+μ(b=1Bq^t(b)nbnˉ).\mathcal{L}(\{n_b\}, \mu) = -\sum_{b=1}^B \hat q_t(b)\, \hat J_b(\theta; n_b) + \mu \left(\sum_{b=1}^B \hat q_t(b)\, n_b - \bar n \right). Each bin bb faces a penalized bandit loss: Lb(n)=J^b(θ;n)+μn.L_b(n) = -\hat J_b(\theta; n) + \mu n.

3. Variance Proxy and the Square-Root Allocation Law

The effect of nbn_b on policy improvement manifests through variance reduction in Monte Carlo gradient estimation. Under mild bounded-difference conditions, the per-prompt gradient variance for bin bb is bounded as: Var[g^(x;θ,nb)]vb(θ)nb,\mathrm{Var}[\hat g(x; \theta, n_b)] \le \frac{v_b(\theta)}{n_b}, where vb(θ)v_b(\theta) is the intrinsic gradient variance for bin bb. The overall batch-level variance proxy is: VarProxy({nb};θ)=b=1Bq^t(b)vb(θ)nb.\mathrm{VarProxy}(\{n_b\}; \theta) = \sum_{b=1}^B \hat q_t(b) \frac{v_b(\theta)}{n_b}. Relaxing nbn_b to R>0\mathbb{R}_{>0} leads to a convex optimization problem whose Karush-Kuhn-Tucker (KKT) solution yields the "square-root law": nb=nˉvb(θ)j=1Bq^t(j)vj(θ),n_b^* = \bar n \cdot \frac{\sqrt{v_b(\theta)}}{\sum_{j=1}^B \hat q_t(j)\, \sqrt{v_j(\theta)}}, implying that bins with larger variance vbv_b are assigned disproportionately more rollouts, scaling as vb\sqrt{v_b} (Panaganti et al., 27 Jan 2026).

4. No-Regret Primal–Dual Controller and Algorithmic Implementation

The allocation is realized as a discrete zero-sum game between a primal player, distributing rollout arms via an EXP3P/entropic mirror-descent update, and a dual player, updating the shadow price μ\mu by projected gradient ascent on the budget violation. At each step:

  • Each bin bb samples rollout arm nt,bn_{t,b} from a mixture of current distribution pt,bp_{t,b} and uniform exploration.
  • The exact mean budget constraint is enforced (via dynamic programming if necessary).
  • Bin-utility J^b(θ;nt,b)\hat J_b(\theta; n_{t,b}) is estimated.
  • Primal updates use bandit loss J^b(θ;n)+μtn-\hat J_b(\theta; n) + \mu_t n.
  • The dual variable μt+1\mu_{t+1} is updated as μt+αμ(nˉtnˉ)\mu_{t} + \alpha_\mu (\bar n_t - \bar n), projected to a prescribed interval. The algorithm guarantees (Theorem B.11 in the paper) that the saddle-point gap of the averaged iterates is of order O((logK+μmax2)/T)O\left((\log K + \mu_{\max}^2) / \sqrt{T}\right) after TT training steps, ensuring no-regret and near-optimality for the original budgeted variance minimization.

5. Pseudocode Summary

A compact version of Rollout-GDRO is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
initialize p_{1,b} = Uniform(N) for all b; mu_1 = 0
for t = 1 to T:
    # compute empirical bin fractions
    observe batch, compute q_t(b) for each bin
    # primal: sample arms
    for each bin b:
        n_{t,b} ~ (1-gamma_p)*p_{t,b} + gamma_p*Uniform(N)
    # enforce budget constraint
    adjust {n_{t,b}} using dynamic programming so sum_b q_t(b) n_{t,b} = n_bar
    # collect rollouts and compute GRPO losses
    for each bin b and prompt:
        run n_{t,b} rollouts and compute ell_{i,j}
    # estimate bin-loss
    compute J_b(theta; n_{t,b}) for each b
    # primal update
    for each b, n in N:
        hat_L_{t,b}(n) = -J_b(theta; n) + mu_t*n
        p_{t+1,b}(n)  p_{t,b}(n) * exp(-eta_p * hat_L_{t,b}(n))
    # dual update
    n_bar_t = sum_b q_t(b) n_{t,b}
    mu_{t+1} = mu_t + alpha_mu * (n_bar_t - n_bar)  # project if needed
return theta_T, p_{T,b}

6. Empirical Evaluation and Performance

Rollout-GDRO was evaluated on the DAPO-14.1k math reasoning dataset, using pass@8 as the online difficulty classifier and Qwen3-Base models at 1.7B, 4B, and 8B parameter scales. Under a mean budget of nˉ=4\bar n = 4 (with allowed arms n[2,12]n \in [2,12]), Rollout-GDRO achieved substantial improvements in pass@8 accuracy at no additional sampling compute:

Model Size GRPO Rollout-GDRO Relative Gain
1.7B 50.74% 56.14% +10.64%
4B 56.31% 62.27% +10.59%
8B 62.04% 67.75% +9.20%

7. Emergent Allocation Patterns and Qualitative Analysis

Rollout-GDRO exhibits several qualitatively distinct behaviors:

  • Budget Frontier: The allocation decouples compute from data frequency; rare, hard bins can receive 3×–10× the rollouts assigned by uniform baselines.
  • Variance Reduction: The weighted standard-error proxy (WSE) is reduced by 37.1%, 22.6%, and 33.4% for 1.7B, 4B, and 8B models, respectively, relative to uniform rollout allocation.
  • Staircase Allocation: The shadow price μ\mu induces abrupt transitions between discrete rollout arm choices for bins (“staircase” patterns in allocation snapshots).
  • Multiplier Effect: At mid-training, bins containing fewer than 20% of prompts may command over 80% of the total rollout budget.

These findings collectively support the conclusion that dynamically adapting the rollout budget using a GDRO-style shadow-price controller induces an emergent curriculum, shifting computational resources toward evolving hard-task frontiers and yielding large robustness improvements for LLM reasoning, all without increasing total sampling compute (Panaganti et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rollout-GDRO.