iGRPO: Iterative Group Relative Policy Optimization
- The paper’s main contribution is replacing a learned value function with group-based advantage estimation, yielding robust performance improvements.
- iGRPO combines iterative group normalization with self-feedback-driven two-stage updates to enhance both on- and off-policy training regimes.
- Empirical results show iGRPO achieves consistent improvements in mathematical reasoning tasks with stable convergence and scalability.
Iterative Group Relative Policy Optimization (iGRPO) is a reinforcement learning framework developed for post-training alignment of LLMs using verifiable, group-normalized rewards. iGRPO generalizes and addresses key biases in Group Relative Policy Optimization (GRPO), providing both theoretical guarantees and enhanced empirical performance across mathematical reasoning and other structured tasks. The method combines group-based advantage estimation with iterative updates or self-feedback-driven conditioning, supporting both on- and off-policy training regimes and yielding robust and scalable improvements without reliance on a learned value function.
1. Mathematical Foundations of GRPO and iGRPO
At its foundation, GRPO operates on sampled groups of completions per prompt, using their rewards to derive centered or standardized advantages for each sample. Let denote the prompt and the group of candidate responses. The group-centered (mean-removed) advantage is: or, in standardized form with groupwise mean and std ,
where prevents division by zero (Hatamizadeh et al., 9 Feb 2026).
The general surrogate objective for GRPO-style methods is formulated as: with denoting per-token weights, 0 the importance ratio between new and old policy, and 1 typically a KL-regularization penalty (Fontana et al., 8 Jan 2026).
iGRPO builds on this by (a) replacing the value function with a group-based advantage as above, (b) supporting iterative application, and (c) extending to two-stage or off-policy scenarios (Mroueh et al., 28 May 2025, Mroueh, 9 Mar 2025, Hatamizadeh et al., 9 Feb 2026).
2. Theoretical Properties and Bias Corrections
GRPO exhibits several structural objective mismatches:
- Non-uniform group weighting, via non-constant 2, introduces systematic gradient bias on prefixes shared among group sequences. For length-normalized weights (3), shorter sequences can disproportionately influence shared prefix updates, causing a form of structural length bias independent of reward structure (Fontana et al., 8 Jan 2026).
- The surrogate objective’s token-level gradient in the unclipped region is: 4
- To ensure unbiasedness, group weights 5 must sum to a constant, or explicit bias correction terms applied to ensure cancellation over any set of shared prefixes.
A central design principle for iGRPO is to enforce unbiased group weighting, correct or account for reward scaling (especially when 6 and the optimizer is AdamW), and address optimizer-driven momentum "overshoot" when using clipped objectives. Notably, AdamW’s updates are invariant to global reward scaling under 7, but this property breaks when KL regularization is enabled (8) (Fontana et al., 8 Jan 2026).
3. Iterative and Two-Stage iGRPO Algorithms
Standard Iterative iGRPO
The canonical iGRPO loop is as follows (Mroueh et al., 28 May 2025, Mroueh, 9 Mar 2025):
- Sample a batch of prompts 9.
- For each 0, sample 1 outputs 2 from the current or lagged policy.
- Compute group-based mean and std of the rewards; normalize each 3’s advantage.
- Construct the clipped, KL-regularized surrogate loss using the group-based advantage.
- Update the policy via one or more gradient steps; iterate as needed, with options for on-policy (4) or off-policy (5) sample reuse.
- Repeat steps 1-5, plugging the new policy back as the "old" in the next round.
Two-Stage (Self-Feedback) iGRPO
A major variant is the self-feedback-driven, two-stage iGRPO (Hatamizadeh et al., 9 Feb 2026):
- Stage 1 (Exploration): For each prompt 6, sample 7 drafts from the frozen policy; pick the highest-reward draft 8.
- Stage 2 (Refinement): Augment 9 with 0 as an in-context example (1), then sample 2 completions from 3, normalize as above, and apply a GRPO-style update on these samples.
- Only Stage 2 gradients are used for learning; Stage 1 influences learning indirectly via the structure and challenge of 4.
This architecture can be summarized by the following pseudo-algorithm:
8
This dynamic introduces a bootstrapped, policy-coupled feedback that empirically delays entropy collapse and improves exploration (Hatamizadeh et al., 9 Feb 2026).
4. Policy-Improvement Guarantees and Convergence
For both on- and off-policy iGRPO, theoretical lower bounds on expected reward improvement can be established (Mroueh et al., 28 May 2025): 5 where 6 denotes the value of the clipped, whitened surrogate, and 7 is total variation.
For binary verifiable rewards, the iGRPO recursion admits a closed-form in terms of the distribution’s empirical success probability 8, weights 9, and KL-regularization 0: 1 This induces a scalar map 2 whose unique fixed point 3 amplifies the probability of success (Mroueh, 9 Mar 2025). Local contraction and monotonic convergence to 4 are guaranteed for appropriate 5.
5. Empirical Performance and Practical Configurations
Empirical studies on math and reasoning tasks demonstrate that iGRPO, in both vanilla and two-stage forms, reliably matches or outperforms single-step GRPO and self-verification baselines under matched rollout budgets (Mroueh et al., 28 May 2025, Hatamizadeh et al., 9 Feb 2026). Key findings include:
- On GSM8K, math challenge benchmarks, and DeepScaleR-Preview, iGRPO is consistently as stable or more stable than on-policy GRPO (e.g., Pass@1: on-policy, 45% ± 3%; iGRPO, 50% ± 1%) (Mroueh et al., 28 May 2025).
- In large-scale settings (Nemotron-8B, DeepSeek-7B, OpenMath-7B/14B), iGRPO delivers a +1–4 point gain over GRPO with the same or reduced serve-side inference cost, as the staged setup induces better learning signals under group-based normalization (Hatamizadeh et al., 9 Feb 2026).
- Ablations confirm that the two-stage wrapper is optimizer-agnostic and that the use of a generative judge (GPT-5) brings further improvements. Entropy collapse occurs more slowly, preserving exploration (Hatamizadeh et al., 9 Feb 2026).
A summary of key hyperparameters and their rationale: | Parameter | Default/Range | Significance | |---------------------|----------------------|--------------------------------------------------------------| | 6, 7 | 8 | Budget-split between drafts and refinements. | | Learning rate | 9 | Empirically validated. | | KL penalty (0)| 1 or small 2 | No regularization if value-insensitive; otherwise tune. | | Decoding temperature/p | 3, 4 | Supports moderate exploration. |
6. Algorithmic Recommendations and Practical Caveats
To address the documented hidden biases and inefficiencies in GRPO, effective iGRPO design incorporates the following recommendations (Fontana et al., 8 Jan 2026):
- Unbiased group weighting: Enforce 5 when weighted by the centered advantage, especially for shared prefixes.
- Reward scaling and optimizer configuration: If using AdamW without KL regularization, reward scaling becomes irrelevant for parameter updates due to moment cancellation. When regularization is required (6), tune 7 jointly with the reward scale as both impact update magnitude.
- Momentum overshoot control: Use single-step updates (no inner-loop SGD), or apply moment resets/repositioning for AdamW if more steps are needed.
- Metrics: Monitor held-out reward distributions, not the surrogate objective, for true progress.
A practice-supported implication is that, for LLMs using group-based objectives, subtle choices in weighting, batching, and normalization are critical for unbiased, efficient policy improvement; iGRPO methods explicitly address these angles.
7. Broader Impact and Generality
iGRPO frameworks have been adopted for LLM post-training in math reasoning, code synthesis, and other domains with verifiable or scalar rewards (Hatamizadeh et al., 9 Feb 2026, Mroueh et al., 28 May 2025). The group-relative, critic-free approach provides a modular, scalable alternative to value-based RL, with theoretical guarantees and extensibility to both on-policy and off-policy regimes. The two-stage iGRPO wrapper generalizes beyond GRPO surrogates and can be fruitfully combined with diverse RL and reward modeling strategies.
Key empirical and theoretical results demonstrate that iGRPO
- Offers consistent multi-point improvements on competitive reasoning datasets.
- Incurs negligible additional computation or inference budget compared to GRPO.
- Enables self-feedback-driven learning dynamics, supporting delayed mode-collapse and improved sampling efficiency.
- Amplifies success probability with guaranteed convergence under mild regularization settings, for both binary and scalar rewards.
By correctly applying group normalization, reward shaping, optimizer control, and iterative bootstrapped learning, iGRPO sets a new standard for scalable, verifiable reward-driven LLM alignment (Fontana et al., 8 Jan 2026, Hatamizadeh et al., 9 Feb 2026, Mroueh et al., 28 May 2025, Mroueh, 9 Mar 2025).