Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoScale-RL: Scalable RL Methodology

Updated 28 January 2026
  • CoScale-RL is a reinforcement learning scaling methodology that synchronously augments data and computation to extend the ability boundary of large reasoning models.
  • The framework dynamically increases correct solution traces per problem to ensure nonzero reward signals and stabilize learning in sparse-reward environments.
  • By employing per-group rollout allocation and re-distillation, CoScale-RL achieves notable Pass@1 improvements across challenging benchmarks.

CoScale-RL is a reinforcement learning (RL) scaling methodology designed to improve the post-training of Large Reasoning Models (LRMs) by synchronously scaling data (solutions per problem) and computation (rollout allocation), with demonstrated gains in both data and computational efficiency for hard or “un1 problems. The framework is distinguished by its approach to overcoming the ability boundary imposed by traditional supervised fine-tuning (SFT), and by stabilizing RL procedures that would otherwise fail due to sparse-reward or low-probability signal regimes (Chen et al., 21 Jan 2026).

1. Motivation and Problem Context

The training of LRMs for reasoning tasks typically combines SFT and RL, but two principal limitations persist:

  • Ability Boundary: LRMs are only reliably proficient within their "fundamental ability boundary," signified by the set of problems where SFT Pass@1 is strictly positive. Harder problems with near-zero initial solvability remain inaccessible to RL or SFT alone, as SFT on a small corpus may even degrade subsequent RL efficacy.
  • RL Instability for Hard/Weak Models: For tasks with tiny success probability p1p \ll 1 (especially on weakly initialized models), reward signals during RL rollouts are almost always zero, causing policy gradients to vanish and requiring impractically large rollout sizes N1/pN \gg 1/p to yield any learning signal.

CoScale-RL addresses these by a co-scaling strategy: (i) scaling the number of correct solutions per problem in SFT so that each instance becomes solvable (Pass@1τSFT\text{Pass@1} \geq \tau_{\text{SFT}}), and (ii) dynamically scaling the RL rollout count per problem group according to empirical difficulty. This two-axis scaling transitions all problem instances into RL-amenable regimes and stabilizes training (Chen et al., 21 Jan 2026).

2. Data Co-Scaling: Dynamic Solution Augmentation

Rather than expanding the dataset by collecting a single solution per new problem, CoScale-RL augments each hard or unsolvable problem pp with additional correct solution traces up to a desired threshold. Denoting Sp(k)S_p^{(k)} as the number of SFT solutions for problem pp after iteration kk, the effective dataset size is D(k)=pSp(k)|\mathcal{D}^{(k)}| = \sum_p S_p^{(k)}.

In the canonical uniform-augmentation regime, Sp(k)=1+2kS_p^{(k)} = 1 + 2^k for pp in the set of unsolvable problems U\mathcal{U}, and progression continues until empirical Pass@1 for a problem meets or exceeds τSFT\tau_{\text{SFT}} (typically 5%5\%). The dataset thus grows sublinearly with each iteration, concentrating scaling efforts on the hardest instances. This ensures that every problem introduced to RL has a nonzero empirical success rate, avoiding wasted computation on zero-gradient cases (Chen et al., 21 Jan 2026).

3. Computation Co-Scaling: Per-Group Rollout Allocation

Within each RL phase, only problems in the "solvable" set S\mathcal{S} (those meeting Pass@1 criterion) are considered. These are sorted by estimated baseline accuracy and partitioned into mm groups {Gi}i=1m\{\mathcal{G}_i\}_{i=1}^m. Each group Gi\mathcal{G}_i is allocated its own rollout budget NiN_i. Initially, easier groups are assigned smaller NiN_i, and harder ones receive larger values. Should learning plateau (as indicated by groupwise reward metrics), NiN_i for that group is doubled.

Theoretical justification is provided in terms of quadratic compute efficiency, where, for learning rate η\eta and rollout count NN, the compute efficiency metric

E:=ΔAccuracyNrollout\mathcal{E} := \frac{\Delta \mathrm{Accuracy}}{N_{\text{rollout}}}

is maximized by optimizing the ratio η/N\eta/N near a task-specific optimum, keeping variance and drift minimal.

For problem instances with low reward probability pp, the expected number of rollouts yielding nonzero reward variance is

N^(p)=N[1pN(1p)N],\hat{N}(p) = N \, [ 1 - p^N - (1-p)^N ],

which increases with NN, directly motivating dynamic scaling of rollout counts for hard subsets. A plausible implication is that this targeting improves both exploration and sample efficiency under sparse reward conditions (Chen et al., 21 Jan 2026).

4. Model Merging via Re-distillation

Because problem groups may be trained with distinct rollout budgets or even separate RL runs, coalescing groupwise progress into a unified model is necessary. CoScale-RL employs a "Re-distillation" technique:

  1. For each problem pp in the solvable set and newly discovered problems, aggregate 100\approx 100 recent correct RL trajectories into Bp\mathcal{B}_p.
  2. Fine-tune the base policy π0\pi_0 (or the previous merged checkpoint) using pBp\bigcup_p \mathcal{B}_p as an SFT dataset.

In formal terms, with πθ\pi_\theta the per-group RL policy, the re-distillation objective is

minϕpτBplogπϕ(τp),\min_{\phi} \sum_{p} \sum_{\tau \in \mathcal{B}_p} -\log \pi_\phi(\tau | p),

which compresses group-level RL improvement into an updated SFT policy, unifying updates while mitigating catastrophic forgetting and facilitating deployment as a single model (Chen et al., 21 Jan 2026).

5. Empirical Results and Ablations

Evaluations are performed on four math-intensive benchmarks (MATH-500, AMC12, OpenMathReasoning, OlympiadMATH) and a general reasoning benchmark (Reasoning GYM), primarily using Qwen2.5-0.5B as a base.

Main Results (Pass@1, mean ± 95% CI):

Method MATH-500 AMC12 OMR OlyMATH GYM
Pretrained 24.7±1.4 2.1±0.3 2.6±0.4 0.29±0.09 6.0±1.1
Pretrain+RL 20.8±0.5 2.7±0.3 3.5±0.5 0.45±0.11 8.6±1.3
SFT (data-scale) 23.2±1.4 1.4±0.2 4.3±0.5 0.24±0.08 5.4±1.1
CoScale-RL 40.7±1.6 7.6±0.5 14.2±0.9 1.26±0.18 6.8±1.1

CoScale-RL delivers an average 3.76×\times improvement in Pass@1 relative to the pretrained baseline. No comparable gains are observed in ablated variants (e.g., SAPO, GRPO, COMPASS, Scale-RL) at identical compute, indicating the primacy of data/computation scaling over changes in RL objective or inference-time policy (Chen et al., 21 Jan 2026).

Ablations indicate that for AIME-level problems, aggregating 50 SFT solutions can directly boost Pass@512 from 0 to 1.3%, and subsequent RL reaches 80% accuracy in 50 steps on long-chain reasoning tasks. Data efficiency studies show that, with fixed total examples, few problems with many solutions (the CoScale-RL regime) outperform many problems with only one solution each. Compute ablations across difficulty groups confirm that targeted rollout scaling achieves >90% Pass@16 on all problems under fixed compute (Chen et al., 21 Jan 2026).

6. Implications, Limitations, and Extensions

CoScale-RL's two-fold scaling paradigm—on both SFT diversity per problem and RL rollouts per difficulty group—reorganizes scaling investments for robust post-training. The approach is largely orthogonal to SFT dataset size or RL loss variant, highlighting where scaling effort is invested as pivotal.

Identified limitations include:

  • Group partitioning, rollout doubling, and hyperparameter settings are presently manual; a plausible direction is to optimize these via meta-RL or Bayesian bandit strategies.
  • Active problem synthesis (e.g., generating bridging problems to connect unsolvable and solvable regimes) is not yet integrated.
  • The current SDE-based theoretical analysis for computational efficiency presumes small learning rate and Gaussian noise; extensions to large batch regimes or non-Gaussian environments are open.
  • While evaluations focus on math problems, direct generalization is anticipated to other sparse-reward or instance-difficult RL tasks, including code generation and theorem proving (Chen et al., 21 Jan 2026).

7. Comparative Position and Future Directions

CoScale-RL introduces a co-scaling axis for RL with empirically verified gains distinct from prior mechanisms such as adaptive RL losses or alternative inference-time strategies (SAPO, Scale-RL). The result is a substantial broadening of the "solvable" problem regime for LRMs without extensive SFT datasets or excessive compute.

Future work is indicated along the dimensions of automatic group discovery, meta-optimization of scaling schedules, advanced problem-synthesis pipelines, non-Gaussian theoretical analyses, and application to other challenging RL settings where per-instance reward sparsity or difficulty hinders learning progress (Chen et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoScale-RL.