CoScale-RL: Scalable RL Methodology

Updated 28 January 2026

CoScale-RL is a reinforcement learning scaling methodology that synchronously augments data and computation to extend the ability boundary of large reasoning models.
The framework dynamically increases correct solution traces per problem to ensure nonzero reward signals and stabilize learning in sparse-reward environments.
By employing per-group rollout allocation and re-distillation, CoScale-RL achieves notable Pass@1 improvements across challenging benchmarks.

CoScale-RL is a reinforcement learning (RL) scaling methodology designed to improve the post-training of Large Reasoning Models (LRMs) by synchronously scaling data (solutions per problem) and computation (rollout allocation), with demonstrated gains in both data and computational efficiency for hard or “un^{^{^{^{1^{^{^{^”}}}}}}} problems. The framework is distinguished by its approach to overcoming the ability boundary imposed by traditional supervised fine-tuning (SFT), and by stabilizing RL procedures that would otherwise fail due to sparse-reward or low-probability signal regimes (Chen et al., 21 Jan 2026).

1. Motivation and Problem Context

The training of LRMs for reasoning tasks typically combines SFT and RL, but two principal limitations persist:

Ability Boundary: LRMs are only reliably proficient within their "fundamental ability boundary," signified by the set of problems where SFT Pass@1 is strictly positive. Harder problems with near-zero initial solvability remain inaccessible to RL or SFT alone, as SFT on a small corpus may even degrade subsequent RL efficacy.
RL Instability for Hard/Weak Models: For tasks with tiny success probability $p \ll 1$ (especially on weakly initialized models), reward signals during RL rollouts are almost always zero, causing policy gradients to vanish and requiring impractically large rollout sizes $N \gg 1/p$ to yield any learning signal.

CoScale-RL addresses these by a co-scaling strategy: (i) scaling the number of correct solutions per problem in SFT so that each instance becomes solvable ( $\text{Pass@1} \geq \tau_{\text{SFT}}$ ), and (ii) dynamically scaling the RL rollout count per problem group according to empirical difficulty. This two-axis scaling transitions all problem instances into RL-amenable regimes and stabilizes training (Chen et al., 21 Jan 2026).

2. Data Co-Scaling: Dynamic Solution Augmentation

Rather than expanding the dataset by collecting a single solution per new problem, CoScale-RL augments each hard or unsolvable problem $p$ with additional correct solution traces up to a desired threshold. Denoting $S_p^{(k)}$ as the number of SFT solutions for problem $p$ after iteration $k$ , the effective dataset size is $|\mathcal{D}^{(k)}| = \sum_p S_p^{(k)}$ .

In the canonical uniform-augmentation regime, $S_p^{(k)} = 1 + 2^k$ for $p$ in the set of unsolvable problems $\mathcal{U}$ , and progression continues until empirical Pass@1 for a problem meets or exceeds $\tau_{\text{SFT}}$ (typically $5\%$ ). The dataset thus grows sublinearly with each iteration, concentrating scaling efforts on the hardest instances. This ensures that every problem introduced to RL has a nonzero empirical success rate, avoiding wasted computation on zero-gradient cases (Chen et al., 21 Jan 2026).

3. Computation Co-Scaling: Per-Group Rollout Allocation

Within each RL phase, only problems in the "solvable" set $\mathcal{S}$ (those meeting Pass@1 criterion) are considered. These are sorted by estimated baseline accuracy and partitioned into $m$ groups $\{\mathcal{G}_i\}_{i=1}^m$ . Each group $\mathcal{G}_i$ is allocated its own rollout budget $N_i$ . Initially, easier groups are assigned smaller $N_i$ , and harder ones receive larger values. Should learning plateau (as indicated by groupwise reward metrics), $N_i$ for that group is doubled.

Theoretical justification is provided in terms of quadratic compute efficiency, where, for learning rate $\eta$ and rollout count $N$ , the compute efficiency metric

$\mathcal{E} := \frac{\Delta \mathrm{Accuracy}}{N_{\text{rollout}}}$

is maximized by optimizing the ratio $\eta/N$ near a task-specific optimum, keeping variance and drift minimal.

For problem instances with low reward probability $p$ , the expected number of rollouts yielding nonzero reward variance is

$\hat{N}(p) = N \, [ 1 - p^N - (1-p)^N ],$

which increases with $N$ , directly motivating dynamic scaling of rollout counts for hard subsets. A plausible implication is that this targeting improves both exploration and sample efficiency under sparse reward conditions (Chen et al., 21 Jan 2026).

4. Model Merging via Re-distillation

Because problem groups may be trained with distinct rollout budgets or even separate RL runs, coalescing groupwise progress into a unified model is necessary. CoScale-RL employs a "Re-distillation" technique:

For each problem $p$ in the solvable set and newly discovered problems, aggregate $\approx 100$ recent correct RL trajectories into $\mathcal{B}_p$ .
Fine-tune the base policy $\pi_0$ (or the previous merged checkpoint) using $\bigcup_p \mathcal{B}_p$ as an SFT dataset.

In formal terms, with $\pi_\theta$ the per-group RL policy, the re-distillation objective is

$\min_{\phi} \sum_{p} \sum_{\tau \in \mathcal{B}_p} -\log \pi_\phi(\tau | p),$

which compresses group-level RL improvement into an updated SFT policy, unifying updates while mitigating catastrophic forgetting and facilitating deployment as a single model (Chen et al., 21 Jan 2026).

5. Empirical Results and Ablations

Evaluations are performed on four math-intensive benchmarks (MATH-500, AMC12, OpenMathReasoning, OlympiadMATH) and a general reasoning benchmark (Reasoning GYM), primarily using Qwen2.5-0.5B as a base.

Main Results (Pass@1, mean ± 95% CI):

Method	MATH-500	AMC12	OMR	OlyMATH	GYM
Pretrained	24.7±1.4	2.1±0.3	2.6±0.4	0.29±0.09	6.0±1.1
Pretrain+RL	20.8±0.5	2.7±0.3	3.5±0.5	0.45±0.11	8.6±1.3
SFT (data-scale)	23.2±1.4	1.4±0.2	4.3±0.5	0.24±0.08	5.4±1.1
CoScale-RL	40.7±1.6	7.6±0.5	14.2±0.9	1.26±0.18	6.8±1.1

CoScale-RL delivers an average 3.76 $\times$ improvement in Pass@1 relative to the pretrained baseline. No comparable gains are observed in ablated variants (e.g., SAPO, GRPO, COMPASS, Scale-RL) at identical compute, indicating the primacy of data/computation scaling over changes in RL objective or inference-time policy (Chen et al., 21 Jan 2026).

Ablations indicate that for AIME-level problems, aggregating 50 SFT solutions can directly boost Pass@512 from 0 to 1.3%, and subsequent RL reaches 80% accuracy in 50 steps on long-chain reasoning tasks. Data efficiency studies show that, with fixed total examples, few problems with many solutions (the CoScale-RL regime) outperform many problems with only one solution each. Compute ablations across difficulty groups confirm that targeted rollout scaling achieves >90% Pass@16 on all problems under fixed compute (Chen et al., 21 Jan 2026).

6. Implications, Limitations, and Extensions

CoScale-RL's two-fold scaling paradigm—on both SFT diversity per problem and RL rollouts per difficulty group—reorganizes scaling investments for robust post-training. The approach is largely orthogonal to SFT dataset size or RL loss variant, highlighting where scaling effort is invested as pivotal.

Identified limitations include:

Group partitioning, rollout doubling, and hyperparameter settings are presently manual; a plausible direction is to optimize these via meta-RL or Bayesian bandit strategies.
Active problem synthesis (e.g., generating bridging problems to connect unsolvable and solvable regimes) is not yet integrated.
The current SDE-based theoretical analysis for computational efficiency presumes small learning rate and Gaussian noise; extensions to large batch regimes or non-Gaussian environments are open.
While evaluations focus on math problems, direct generalization is anticipated to other sparse-reward or instance-difficult RL tasks, including code generation and theorem proving (Chen et al., 21 Jan 2026).

7. Comparative Position and Future Directions

CoScale-RL introduces a co-scaling axis for RL with empirically verified gains distinct from prior mechanisms such as adaptive RL losses or alternative inference-time strategies (SAPO, Scale-RL). The result is a substantial broadening of the "solvable" problem regime for LRMs without extensive SFT datasets or excessive compute.

Future work is indicated along the dimensions of automatic group discovery, meta-optimization of scaling schedules, advanced problem-synthesis pipelines, non-Gaussian theoretical analyses, and application to other challenging RL settings where per-instance reward sparsity or difficulty hinders learning progress (Chen et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

CoScale-RL: Efficient Post-Training by Co-Scaling Data and Computation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoScale-RL.

CoScale-RL: Scalable RL Methodology

1. Motivation and Problem Context

2. Data Co-Scaling: Dynamic Solution Augmentation

3. Computation Co-Scaling: Per-Group Rollout Allocation

4. Model Merging via Re-distillation

5. Empirical Results and Ablations

Main Results (Pass@1, mean ± 95% CI):

6. Implications, Limitations, and Extensions

7. Comparative Position and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CoScale-RL: Scalable RL Methodology

1. Motivation and Problem Context

2. Data Co-Scaling: Dynamic Solution Augmentation

3. Computation Co-Scaling: Per-Group Rollout Allocation

4. Model Merging via Re-distillation

5. Empirical Results and Ablations

Main Results (Pass@1, mean ± 95% CI):

6. Implications, Limitations, and Extensions

7. Comparative Position and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research