Scaf-GRPO: Scaffolded RL for LLMs

Updated 3 February 2026

Scaf-GRPO is a reinforcement learning framework that injects minimal, tiered hints to overcome sparse-reward issues in training large language models.
It employs a two-phase, on-policy process where initial unassisted exploration transitions to hierarchical hint-guided exploration to restore learning signals.
Empirical evaluations show significant performance gains on benchmarks like AIME24, with ablation studies underscoring the importance of the progressive hint mechanism.

Scaf-GRPO (Scaffolded Group Relative Policy Optimization) is a progressive reinforcement learning framework designed to address the shortcomings of sparse-reward policy optimization algorithms when training LLMs on complex, verifiably-scored reasoning tasks. It systematically injects minimal, tiered in-prompt hints only when independent learning has plateaued, thereby restoring learning signal and enabling models to progress on problem instances previously beyond their autonomous reach (Zhang et al., 22 Oct 2025).

1. Motivation: Overcoming the Learning Cliff in GRPO

The central motivation for Scaf-GRPO is the “learning cliff” phenomenon observed in Group Relative Policy Optimization (GRPO) applied to LLMs. In standard GRPO, for each query $q$ , a group of $N$ model rollouts $\mathcal{G} = \{o_1, ..., o_N\}$ is produced, with each solution string $o_i$ receiving a sparse binary reward $R(o_i) \in \{0, 1\}$ from a verifier. The normalized advantage for each trajectory is then

$\hat{A}_i = \frac{R(o_i) - \mu_{\mathcal{G}}}{\sigma_{\mathcal{G}} + \varepsilon_{std}}$

where $\mu_{\mathcal{G}}$ and $\sigma_{\mathcal{G}}$ are the group mean and standard deviation of rewards. If all $R(o_i) = 0$ (i.e., every attempt fails on a hard problem), then $\mu_{\mathcal{G}} = \sigma_{\mathcal{G}} = 0$ and $N$ 0 for all trajectories, causing the policy gradient to vanish. In this regime, dubbed the learning cliff, “true-hard” examples become invisible to the learning signal, stalling progress.

Off-policy prefix guidance methods (e.g., seeding the model with gold prefixes) avoid zero rewards but create unwanted distribution shifts, high-variance gradient corrections, and restrict the exploration essential for robust skill acquisition.

2. Formal Framework and Algorithm

Solution generation is formulated as a Markov Decision Process (MDP):

State: $N$ 1; the prompt and generated tokens so far
Action: $N$ 2; the next token
Policy: $N$ 3; parameterized by the LLM
Reward: Terminal $N$ 4 from a verifier

Within-group normalization yields:

$N$ 5
$N$ 6
GRPO objective (with clipping):

$N$ 7

where $N$ 8.

3. The Scaffolded Training Process

Scaf-GRPO is implemented as a two-phase, on-policy extension of GRPO:

Phase 1: Guidance Exemption

For the first 15% of training steps, zero intervention is allowed. This phase lets the model resolve queries with formatting or surface-level deficiencies—a regime dominated by “pseudo-hard” explorations. Zero-reward plateau statistics are monitored; plateauing in the batchwise count signals identification of “true-hard” queries.

Phase 2: Hierarchical Hint-Guided Exploration

A three-tier hint hierarchy $N$ 9 is pre-generated per problem:

Knowledge hints: key formula or core concept
Planning hints: high-level solution outline
Solution hints: concrete calculation step

Each tier is decomposed into four incrementally detailed steps (generated via an offline teacher model prompt). For each true-hard failure, the algorithm incrementally injects hints (starting from abstract, progressing to detailed) into the prompt. If a rollout with the augmented prompt achieves reward $\mathcal{G} = \{o_1, ..., o_N\}$ 0, one random failed trajectory is swapped for $\mathcal{G} = \{o_1, ..., o_N\}$ 1, yielding $\mathcal{G} = \{o_1, ..., o_N\}$ 2 with nonzero variance.

The same on-policy update is then performed:

$\mathcal{G} = \{o_1, ..., o_N\}$ 3

Notably, all trajectories remain on-policy relative to $\mathcal{G} = \{o_1, ..., o_N\}$ 4, as hints only augment the prompt.

Algorithmic Outline:

$o_i$ 9

4. Key Hyperparameters and Implementation

Primary hyperparameters and settings include:

Guidance exemption fraction $\mathcal{G} = \{o_1, ..., o_N\}$ 5 of total training steps
Rollouts per query $\mathcal{G} = \{o_1, ..., o_N\}$ 6
GRPO clip $\mathcal{G} = \{o_1, ..., o_N\}$ 7; KL penalty $\mathcal{G} = \{o_1, ..., o_N\}$ 8 (favours exploration)
Hint hierarchy: Three tiers, four levels per tier, incrementally injected
Dataset filtering: From DeepScaleR-Preview (40K math problems), “too easy” dropped, “potentially solvable” subsampled, “too hard” retained
Model families: Qwen2.5-Math-1.5B/7B, Qwen2.5-7B, Llama-3.2-3B, DeepSeek-R1-Distill-Qwen-1.5B LongCoT
Training: 10 epochs, AdamW learning rate $\mathcal{G} = \{o_1, ..., o_N\}$ 9, global batch size 256, PPO mini-batch 64, max generation length 2048 (8192 for LongCoT)

5. Effects and Mechanisms of Scaffolded Guidance

Minimal, progressive scaffolding enables the following properties:

Restoration of Reward Variance: Inserting a single successful trajectory into the group disrupts reward degeneracy, restoring a usable learning signal even on true-hard queries.
On-Policy Integrity: Since all augmented runs are sampled from the current policy with controlled prompts, there is no requirement for high-variance off-policy corrections.
Skill Internalization: Forcing the model to succeed with the most abstract hint possible fosters true generalization, rather than rote memorization.
Unhampered Exploration: Hints serve as “signposts” rather than “rails,” allowing the model to deviate and discover novel solution paths.

A plausible implication is that this approach can systematically elevate LLM performance ceilings on tasks that otherwise yield persistent zero rearwards under standard on-policy RL protocols.

6. Empirical Evaluation and Results

Performance was benchmarked on seven challenging math tasks (pass@1, greedy decoding), notably on the AIME24 benchmark and several out-of-distribution settings. Key quantitative results for Qwen2.5-Math-7B:

Method	AIME24	Avg. over 7 benchmarks
Vanilla GRPO	30.0	45.2
SimpleRL-Zero	23.3	42.6
Oat-Zero	30.0	46.5
LUFFY (prefix)	33.3	46.6
Scaf-GRPO	43.3 (+44.3%)	50.9 (+12.6%)

Additional highlights:

Qwen2.5-1.5B on AIME24: $o_i$ 0 (+50% relative)
Llama-3.2-3B average: $o_i$ 1 (+10.3% relative)
DeepSeek-R1 LongCoT: $o_i$ 2 (+5.9% relative)
Out-of-distribution GPQA-Diamond: vanilla GRPO $o_i$ 3, Scaf-GRPO $o_i$ 4

Ablation studies reveal:

No guidance exemption reduces performance by $o_i$ 5 (relative)
Solution-only hints (no progressive hierarchy): $o_i$ 6
Removing any hint tier: up to $o_i$ 7
No incremental hint chunking (all hints at once): $o_i$ 8

7. Limitations and Extensions

Scaf-GRPO is model-agnostic, applicable to various LLM architectures, sizes, and reasoning styles, and fosters stable skill acquisition. Nonetheless, it requires:

Offline (teacher model) hint generation
Reward functions that are sparse but verifiable

Potential extensions include:

Automated or online hint synthesis
Adapting scaffolding schedules dynamically
Application to program synthesis, logic puzzles, and other sequential-decision LLM domains subject to sparse reward structures

These directions suggest a broad future applicability for the underlying pedagogy-inspired principle of incrementally restoring learning variance with minimal, on-policy interventions (Zhang et al., 22 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scaf-GRPO.