- The paper introduces iGRPO, a two-stage reinforcement learning framework that leverages dynamic self-feedback to iteratively improve LLM reasoning.
- It refines policy updates by selecting high-quality drafts through group normalization, replacing traditional value functions with a more efficient evaluation mechanism.
- Empirical evaluations on mathematical benchmarks demonstrate significant accuracy gains with minimal computational overhead, highlighting its scalability and practical impact.
Iterative Group Relative Policy Optimization (iGRPO) for LLM Reasoning
Introduction
This work introduces Iterative Group Relative Policy Optimization (iGRPO), a two-stage RL framework designed to enhance the reasoning capabilities of LLMs via dynamic self-feedback. iGRPO extends Group Relative Policy Optimization (GRPO)—a value-function-free variant of PPO—by adding a self-improvement loop that leverages model-generated drafts as in-context guides. The motivation is to align LLM training protocols more closely with human problem-solving, which typically involves iterative refinement based on feedback from prior attempts. Existing RL approaches, including GRPO and its variants, lack explicit mechanisms for such self-conditioning, often limiting models to single-pass optimization or auxiliary self-critique tasks. iGRPO operationalizes bootstrapped policy refinement while maintaining computational efficiency, pushing benchmarks on complex verifiable mathematical reasoning tasks.
Methodology
Group Relative Policy Optimization (GRPO) Background
GRPO replaces the value function in standard PPO with group-based normalization. Given a prompt q, the model samples G completions, evaluates a scalar reward for each, and computes advantages by normalizing within the group. The update maximizes a clipped surrogate objective, incorporating token-level KL regularization to a reference policy. This formulation is computationally efficient and obviates the need for a learned critic but treats each sampled completion as independent, ignoring potential improvements from iterative refinement.
iGRPO: Dynamic Self-Conditioning Pipeline
iGRPO operates in two stages per optimization step:
- Stage 1—Exploratory Draft Generation: Multiple drafts are sampled from the current policy. Each is evaluated with a scalar reward, and the highest-scoring draft is selected as a "first-draft" candidate.
- Stage 2—Conditioned Refinement: The model appends the best draft to the original prompt, forming an augmented context. A group of completions is generated conditioned on this prompt, and a GRPO-style update is applied to these refinements.
This bootstrapped approach ensures that as the policy improves, the quality of conditioning signals (best drafts) increases, amplifying the refinement loop. The process is computationally matched to GRPO—iGRPO simply reallocates the same rollout budget across both stages.

Figure 1: Average training reward.
Theoretical Analysis
Monotonic improvement is established under binary rewards: as the policy's success probability increases, the expected quality of selected drafts rises, creating a positive feedback loop. Self-conditioning in iGRPO is policy-dependent—unlike traditional ICL with static demonstrations—so the learning dynamics form a coupled system that facilitates progressive curriculum generation.
Empirical Evaluation
A thorough suite of experiments on mathematical reasoning benchmarks (AIME24, AIME25, MATH500, AMC, GSM8K, Minerva Math) are conducted across multiple backbone families and scales (7B, 8B, 14B). iGRPO consistently outperforms vanilla GRPO, Self-Verification, and Critique-GRPO under identical training protocols and sampling budgets. Gains are most pronounced in settings where the base model’s single-shot performance is suboptimal, and self-feedback provides a salient scaffold for refinement.
Figure 2: Performance of iOpenMath-Nemotron-14B across various pass@N settings for AIME24 and AIME25. Both benchmarks exhibit increasing accuracy with higher N, though AIME24 quickly stabilizes at 93.33\% by N=16, whereas AIME25 continues to rise until reaching 96.67\% at N=256.
For OpenReasoning-Nemotron-7B trained on AceReason-Math, state-of-the-art results of 85.62\% and 79.64\% are achieved on AIME24/AIME25, respectively, with clear transferability to broader tasks (e.g., GPQA, MMLU-Pro). Ablations demonstrate that the self-feedback stage is a portable refinement layer: applying the same mechanism atop DAPO and GSPO yields +1.1 to +1.2 points in average Pass@1 accuracy, indicating that the iterative wrapper is orthogonal to GRPO-specific objective details.
Learning Dynamics and Resource Efficiency
Policy entropy trajectories reveal that iGRPO sustains higher mid-training entropy, delaying premature mode collapse and preserving exploratory diversity. Memory and throughput measurements confirm that the two-stage mechanism introduces negligible overhead—peak memory remains unchanged and throughput only decreases modestly (~13% more wall-clock training time for substantially improved accuracy). Increasing the number of completions yields diminishing returns beyond 8, suggesting the budget can be judiciously managed according to practical constraints.
Practical and Theoretical Implications
The key implication is that iterative self-feedback mechanisms—modeled after human refinement behavior—can be systematically exploited in RL fine-tuning to advance multi-step reasoning. iGRPO’s design provides a practical, computationally efficient method for bootstrapping policy improvement, preserving alignment with verifiable rewards without requiring auxiliary critique or verification behaviors. The consistent empirical gains across architectures and parameter sizes highlight the method’s generality and scalability. From a theoretical perspective, the bootstrapping effect can guide future development of RL objectives that adapt dynamically with evolving policy strengths rather than relying on static examples or external demonstrations.
Practically, iGRPO is poised for adoption in high-throughput, resource-constrained environments where single-pass inference cannot reliably solve complex reasoning tasks. Its compatibility with generative scalar judges enables refinement for tasks without strict ground-truth solutions, broadening its applicability to domains beyond mathematics. Future directions include integrating richer reward shaping, more sophisticated draft selection, and scaling to even larger models with multi-modal or open-ended reasoning objectives.
Conclusion
Iterative Group Relative Policy Optimization (iGRPO) exemplifies efficient, self-feedback-driven RL for LLM reasoning, achieving robust improvements on mathematical benchmarks and demonstrating portability across group-based PPO variants. The methodology bridges the gap between human iterative problem-solving and LLM training, suggesting iterative refinement as a core principle for verifiable reasoning. With negligible computational overhead and strong empirical performance, iGRPO marks a substantive step toward scalable, self-improving LLMs for complex task domains.