Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaf-GRPO: Scaffolded RL for LLMs

Updated 3 February 2026
  • Scaf-GRPO is a reinforcement learning framework that injects minimal, tiered hints to overcome sparse-reward issues in training large language models.
  • It employs a two-phase, on-policy process where initial unassisted exploration transitions to hierarchical hint-guided exploration to restore learning signals.
  • Empirical evaluations show significant performance gains on benchmarks like AIME24, with ablation studies underscoring the importance of the progressive hint mechanism.

Scaf-GRPO (Scaffolded Group Relative Policy Optimization) is a progressive reinforcement learning framework designed to address the shortcomings of sparse-reward policy optimization algorithms when training LLMs on complex, verifiably-scored reasoning tasks. It systematically injects minimal, tiered in-prompt hints only when independent learning has plateaued, thereby restoring learning signal and enabling models to progress on problem instances previously beyond their autonomous reach (Zhang et al., 22 Oct 2025).

1. Motivation: Overcoming the Learning Cliff in GRPO

The central motivation for Scaf-GRPO is the “learning cliff” phenomenon observed in Group Relative Policy Optimization (GRPO) applied to LLMs. In standard GRPO, for each query qq, a group of NN model rollouts G={o1,...,oN}\mathcal{G} = \{o_1, ..., o_N\} is produced, with each solution string oio_i receiving a sparse binary reward R(oi){0,1}R(o_i) \in \{0, 1\} from a verifier. The normalized advantage for each trajectory is then

A^i=R(oi)μGσG+εstd\hat{A}_i = \frac{R(o_i) - \mu_{\mathcal{G}}}{\sigma_{\mathcal{G}} + \varepsilon_{std}}

where μG\mu_{\mathcal{G}} and σG\sigma_{\mathcal{G}} are the group mean and standard deviation of rewards. If all R(oi)=0R(o_i) = 0 (i.e., every attempt fails on a hard problem), then μG=σG=0\mu_{\mathcal{G}} = \sigma_{\mathcal{G}} = 0 and A^i=0\hat{A}_i = 0 for all trajectories, causing the policy gradient to vanish. In this regime, dubbed the learning cliff, “true-hard” examples become invisible to the learning signal, stalling progress.

Off-policy prefix guidance methods (e.g., seeding the model with gold prefixes) avoid zero rewards but create unwanted distribution shifts, high-variance gradient corrections, and restrict the exploration essential for robust skill acquisition.

2. Formal Framework and Algorithm

Solution generation is formulated as a Markov Decision Process (MDP):

  • State: st=(q,o<t)s_t = (q, o_{<t}); the prompt and generated tokens so far
  • Action: at=ota_t = o_t; the next token
  • Policy: πθ(atst)\pi_\theta(a_t|s_t); parameterized by the LLM
  • Reward: Terminal R(o){0,1}R(o) \in \{0,1\} from a verifier

Within-group normalization yields:

  • μG=1NiR(oi)\mu_{\mathcal{G}} = \frac{1}{N}\sum_{i} R(o_i)
  • σG=1Ni(R(oi)μG)2\sigma_{\mathcal{G}} = \sqrt{\frac{1}{N}\sum_{i}(R(o_i) - \mu_{\mathcal{G}})^2}
  • GRPO objective (with clipping):

JGRPO(θ)=Ei,t[min(ri,tA^i,clip(ri,t,1ε,1+ε)A^i)]J_{GRPO}(\theta) = \mathbb{E}_{i, t} \left[ \min\left( r_{i,t} \hat{A}_i, \text{clip}(r_{i,t}, 1-\varepsilon, 1+\varepsilon)\hat{A}_i \right) \right]

where ri,t(θ)=πθ(oi,toi,<t,q)πθold()r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|o_{i,<t},q)}{\pi_{\theta_{old}}(\ldots)}.

3. The Scaffolded Training Process

Scaf-GRPO is implemented as a two-phase, on-policy extension of GRPO:

Phase 1: Guidance Exemption

For the first 15% of training steps, zero intervention is allowed. This phase lets the model resolve queries with formatting or surface-level deficiencies—a regime dominated by “pseudo-hard” explorations. Zero-reward plateau statistics are monitored; plateauing in the batchwise count signals identification of “true-hard” queries.

Phase 2: Hierarchical Hint-Guided Exploration

A three-tier hint hierarchy H={Hknowledge,Hplanning,Hsolution}H = \{H_{\text{knowledge}}, H_{\text{planning}}, H_{\text{solution}}\} is pre-generated per problem:

  • Knowledge hints: key formula or core concept
  • Planning hints: high-level solution outline
  • Solution hints: concrete calculation step

Each tier is decomposed into four incrementally detailed steps (generated via an offline teacher model prompt). For each true-hard failure, the algorithm incrementally injects hints (starting from abstract, progressing to detailed) into the prompt. If a rollout with the augmented prompt achieves reward R(oh)>0R(o_{h^*}) > 0, one random failed trajectory is swapped for oho_{h^*}, yielding Gfinal\mathcal{G}_{final} with nonzero variance.

The same on-policy update is then performed:

JScafGRPO(θ)=E[min(ri,tA^i,clip(ri,t,1ε,1+ε)A^i)]J_{Scaf-GRPO}(\theta) = \mathbb{E}\left[ \min(r'_{i,t}\hat{A}_i', \text{clip}(r'_{i,t}, 1-\varepsilon, 1+\varepsilon)\hat{A}_i') \right]

Notably, all trajectories remain on-policy relative to πθ\pi_\theta, as hints only augment the prompt.

Algorithmic Outline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for each training step t:
    sample batch of queries {q}
    for each q:
        roll out N trajectories 𝒢 under π_θ(·|q)
        if any R(o)>0 or t<T_exempt:
            𝒢_final  𝒢
        else:
            (h*, o_h*)  SearchHierarchicalHints(q, π_θ)
            if o_h* found:
                replace a random fail in 𝒢 with o_h*  𝒢_final
            else:
                𝒢_final  𝒢
    compute normalized advantages  on 𝒢_final
    update θ by maximizing clipped-surrogate on 𝒢_final

4. Key Hyperparameters and Implementation

Primary hyperparameters and settings include:

  • Guidance exemption fraction Texempt=15%T_{exempt} = 15\% of total training steps
  • Rollouts per query N=8N = 8
  • GRPO clip ε=0.2\varepsilon = 0.2; KL penalty β=0.0\beta = 0.0 (favours exploration)
  • Hint hierarchy: Three tiers, four levels per tier, incrementally injected
  • Dataset filtering: From DeepScaleR-Preview (40K math problems), “too easy” dropped, “potentially solvable” subsampled, “too hard” retained
  • Model families: Qwen2.5-Math-1.5B/7B, Qwen2.5-7B, Llama-3.2-3B, DeepSeek-R1-Distill-Qwen-1.5B LongCoT
  • Training: 10 epochs, AdamW learning rate 1×1061\times10^{-6}, global batch size 256, PPO mini-batch 64, max generation length 2048 (8192 for LongCoT)

5. Effects and Mechanisms of Scaffolded Guidance

Minimal, progressive scaffolding enables the following properties:

  • Restoration of Reward Variance: Inserting a single successful trajectory into the group disrupts reward degeneracy, restoring a usable learning signal even on true-hard queries.
  • On-Policy Integrity: Since all augmented runs are sampled from the current policy with controlled prompts, there is no requirement for high-variance off-policy corrections.
  • Skill Internalization: Forcing the model to succeed with the most abstract hint possible fosters true generalization, rather than rote memorization.
  • Unhampered Exploration: Hints serve as “signposts” rather than “rails,” allowing the model to deviate and discover novel solution paths.

A plausible implication is that this approach can systematically elevate LLM performance ceilings on tasks that otherwise yield persistent zero rearwards under standard on-policy RL protocols.

6. Empirical Evaluation and Results

Performance was benchmarked on seven challenging math tasks (pass@1, greedy decoding), notably on the AIME24 benchmark and several out-of-distribution settings. Key quantitative results for Qwen2.5-Math-7B:

Method AIME24 Avg. over 7 benchmarks
Vanilla GRPO 30.0 45.2
SimpleRL-Zero 23.3 42.6
Oat-Zero 30.0 46.5
LUFFY (prefix) 33.3 46.6
Scaf-GRPO 43.3 (+44.3%) 50.9 (+12.6%)

Additional highlights:

  • Qwen2.5-1.5B on AIME24: 7.220.07.2 \rightarrow 20.0 (+50% relative)
  • Llama-3.2-3B average: 26.128.826.1 \rightarrow 28.8 (+10.3% relative)
  • DeepSeek-R1 LongCoT: 50.653.650.6 \rightarrow 53.6 (+5.9% relative)
  • Out-of-distribution GPQA-Diamond: vanilla GRPO =32.3%=32.3\%, Scaf-GRPO =37.3%=37.3\%

Ablation studies reveal:

  • No guidance exemption reduces performance by 9.2%-9.2\% (relative)
  • Solution-only hints (no progressive hierarchy): 4.9%-4.9\%
  • Removing any hint tier: up to 5.7%-5.7\%
  • No incremental hint chunking (all hints at once): 6.3%-6.3\%

7. Limitations and Extensions

Scaf-GRPO is model-agnostic, applicable to various LLM architectures, sizes, and reasoning styles, and fosters stable skill acquisition. Nonetheless, it requires:

  • Offline (teacher model) hint generation
  • Reward functions that are sparse but verifiable

Potential extensions include:

  • Automated or online hint synthesis
  • Adapting scaffolding schedules dynamically
  • Application to program synthesis, logic puzzles, and other sequential-decision LLM domains subject to sparse reward structures

These directions suggest a broad future applicability for the underlying pedagogy-inspired principle of incrementally restoring learning variance with minimal, on-policy interventions (Zhang et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaf-GRPO.