CoScale-RL: Scalable RL Methodology
- CoScale-RL is a reinforcement learning scaling methodology that synchronously augments data and computation to extend the ability boundary of large reasoning models.
- The framework dynamically increases correct solution traces per problem to ensure nonzero reward signals and stabilize learning in sparse-reward environments.
- By employing per-group rollout allocation and re-distillation, CoScale-RL achieves notable Pass@1 improvements across challenging benchmarks.
CoScale-RL is a reinforcement learning (RL) scaling methodology designed to improve the post-training of Large Reasoning Models (LRMs) by synchronously scaling data (solutions per problem) and computation (rollout allocation), with demonstrated gains in both data and computational efficiency for hard or “un1” problems. The framework is distinguished by its approach to overcoming the ability boundary imposed by traditional supervised fine-tuning (SFT), and by stabilizing RL procedures that would otherwise fail due to sparse-reward or low-probability signal regimes (Chen et al., 21 Jan 2026).
1. Motivation and Problem Context
The training of LRMs for reasoning tasks typically combines SFT and RL, but two principal limitations persist:
- Ability Boundary: LRMs are only reliably proficient within their "fundamental ability boundary," signified by the set of problems where SFT Pass@1 is strictly positive. Harder problems with near-zero initial solvability remain inaccessible to RL or SFT alone, as SFT on a small corpus may even degrade subsequent RL efficacy.
- RL Instability for Hard/Weak Models: For tasks with tiny success probability (especially on weakly initialized models), reward signals during RL rollouts are almost always zero, causing policy gradients to vanish and requiring impractically large rollout sizes to yield any learning signal.
CoScale-RL addresses these by a co-scaling strategy: (i) scaling the number of correct solutions per problem in SFT so that each instance becomes solvable (), and (ii) dynamically scaling the RL rollout count per problem group according to empirical difficulty. This two-axis scaling transitions all problem instances into RL-amenable regimes and stabilizes training (Chen et al., 21 Jan 2026).
2. Data Co-Scaling: Dynamic Solution Augmentation
Rather than expanding the dataset by collecting a single solution per new problem, CoScale-RL augments each hard or unsolvable problem with additional correct solution traces up to a desired threshold. Denoting as the number of SFT solutions for problem after iteration , the effective dataset size is .
In the canonical uniform-augmentation regime, for in the set of unsolvable problems , and progression continues until empirical Pass@1 for a problem meets or exceeds (typically ). The dataset thus grows sublinearly with each iteration, concentrating scaling efforts on the hardest instances. This ensures that every problem introduced to RL has a nonzero empirical success rate, avoiding wasted computation on zero-gradient cases (Chen et al., 21 Jan 2026).
3. Computation Co-Scaling: Per-Group Rollout Allocation
Within each RL phase, only problems in the "solvable" set (those meeting Pass@1 criterion) are considered. These are sorted by estimated baseline accuracy and partitioned into groups . Each group is allocated its own rollout budget . Initially, easier groups are assigned smaller , and harder ones receive larger values. Should learning plateau (as indicated by groupwise reward metrics), for that group is doubled.
Theoretical justification is provided in terms of quadratic compute efficiency, where, for learning rate and rollout count , the compute efficiency metric
is maximized by optimizing the ratio near a task-specific optimum, keeping variance and drift minimal.
For problem instances with low reward probability , the expected number of rollouts yielding nonzero reward variance is
which increases with , directly motivating dynamic scaling of rollout counts for hard subsets. A plausible implication is that this targeting improves both exploration and sample efficiency under sparse reward conditions (Chen et al., 21 Jan 2026).
4. Model Merging via Re-distillation
Because problem groups may be trained with distinct rollout budgets or even separate RL runs, coalescing groupwise progress into a unified model is necessary. CoScale-RL employs a "Re-distillation" technique:
- For each problem in the solvable set and newly discovered problems, aggregate recent correct RL trajectories into .
- Fine-tune the base policy (or the previous merged checkpoint) using as an SFT dataset.
In formal terms, with the per-group RL policy, the re-distillation objective is
which compresses group-level RL improvement into an updated SFT policy, unifying updates while mitigating catastrophic forgetting and facilitating deployment as a single model (Chen et al., 21 Jan 2026).
5. Empirical Results and Ablations
Evaluations are performed on four math-intensive benchmarks (MATH-500, AMC12, OpenMathReasoning, OlympiadMATH) and a general reasoning benchmark (Reasoning GYM), primarily using Qwen2.5-0.5B as a base.
Main Results (Pass@1, mean ± 95% CI):
| Method | MATH-500 | AMC12 | OMR | OlyMATH | GYM |
|---|---|---|---|---|---|
| Pretrained | 24.7±1.4 | 2.1±0.3 | 2.6±0.4 | 0.29±0.09 | 6.0±1.1 |
| Pretrain+RL | 20.8±0.5 | 2.7±0.3 | 3.5±0.5 | 0.45±0.11 | 8.6±1.3 |
| SFT (data-scale) | 23.2±1.4 | 1.4±0.2 | 4.3±0.5 | 0.24±0.08 | 5.4±1.1 |
| CoScale-RL | 40.7±1.6 | 7.6±0.5 | 14.2±0.9 | 1.26±0.18 | 6.8±1.1 |
CoScale-RL delivers an average 3.76 improvement in Pass@1 relative to the pretrained baseline. No comparable gains are observed in ablated variants (e.g., SAPO, GRPO, COMPASS, Scale-RL) at identical compute, indicating the primacy of data/computation scaling over changes in RL objective or inference-time policy (Chen et al., 21 Jan 2026).
Ablations indicate that for AIME-level problems, aggregating 50 SFT solutions can directly boost Pass@512 from 0 to 1.3%, and subsequent RL reaches 80% accuracy in 50 steps on long-chain reasoning tasks. Data efficiency studies show that, with fixed total examples, few problems with many solutions (the CoScale-RL regime) outperform many problems with only one solution each. Compute ablations across difficulty groups confirm that targeted rollout scaling achieves >90% Pass@16 on all problems under fixed compute (Chen et al., 21 Jan 2026).
6. Implications, Limitations, and Extensions
CoScale-RL's two-fold scaling paradigm—on both SFT diversity per problem and RL rollouts per difficulty group—reorganizes scaling investments for robust post-training. The approach is largely orthogonal to SFT dataset size or RL loss variant, highlighting where scaling effort is invested as pivotal.
Identified limitations include:
- Group partitioning, rollout doubling, and hyperparameter settings are presently manual; a plausible direction is to optimize these via meta-RL or Bayesian bandit strategies.
- Active problem synthesis (e.g., generating bridging problems to connect unsolvable and solvable regimes) is not yet integrated.
- The current SDE-based theoretical analysis for computational efficiency presumes small learning rate and Gaussian noise; extensions to large batch regimes or non-Gaussian environments are open.
- While evaluations focus on math problems, direct generalization is anticipated to other sparse-reward or instance-difficult RL tasks, including code generation and theorem proving (Chen et al., 21 Jan 2026).
7. Comparative Position and Future Directions
CoScale-RL introduces a co-scaling axis for RL with empirically verified gains distinct from prior mechanisms such as adaptive RL losses or alternative inference-time strategies (SAPO, Scale-RL). The result is a substantial broadening of the "solvable" problem regime for LRMs without extensive SFT datasets or excessive compute.
Future work is indicated along the dimensions of automatic group discovery, meta-optimization of scaling schedules, advanced problem-synthesis pipelines, non-Gaussian theoretical analyses, and application to other challenging RL settings where per-instance reward sparsity or difficulty hinders learning progress (Chen et al., 21 Jan 2026).