Papers
Topics
Authors
Recent
Search
2000 character limit reached

iGRPO: Self-Feedback-Driven LLM Reasoning

Published 9 Feb 2026 in cs.AI | (2602.09000v1)

Abstract: LLMs have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a framework for aligning these models with task-specific rewards, improving overall quality and reliability. Group Relative Policy Optimization (GRPO) is an efficient, value-function-free alternative to Proximal Policy Optimization (PPO) that leverages group-relative reward normalization. We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts. In Stage 1, iGRPO samples multiple exploratory drafts and selects the highest-reward draft using the same scalar reward signal used for optimization. In Stage 2, it appends this best draft to the original prompt and applies a GRPO-style update on draft-conditioned refinements, training the policy to improve beyond its strongest prior attempt. Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models (e.g., Nemotron-H-8B-Base-8K and DeepSeek-R1 Distilled), validating its effectiveness on diverse reasoning benchmarks. Moreover, applying iGRPO to OpenReasoning-Nemotron-7B trained on AceReason-Math achieves new state-of-the-art results of 85.62\% and 79.64\% on AIME24 and AIME25, respectively. Ablations further show that the refinement wrapper generalizes beyond GRPO variants, benefits from a generative judge, and alters learning dynamics by delaying entropy collapse. These results underscore the potential of iterative, self-feedback-based RL for advancing verifiable mathematical reasoning.

Summary

  • The paper introduces iGRPO, a two-stage reinforcement learning framework that leverages dynamic self-feedback to iteratively improve LLM reasoning.
  • It refines policy updates by selecting high-quality drafts through group normalization, replacing traditional value functions with a more efficient evaluation mechanism.
  • Empirical evaluations on mathematical benchmarks demonstrate significant accuracy gains with minimal computational overhead, highlighting its scalability and practical impact.

Iterative Group Relative Policy Optimization (iGRPO) for LLM Reasoning

Introduction

This work introduces Iterative Group Relative Policy Optimization (iGRPO), a two-stage RL framework designed to enhance the reasoning capabilities of LLMs via dynamic self-feedback. iGRPO extends Group Relative Policy Optimization (GRPO)—a value-function-free variant of PPO—by adding a self-improvement loop that leverages model-generated drafts as in-context guides. The motivation is to align LLM training protocols more closely with human problem-solving, which typically involves iterative refinement based on feedback from prior attempts. Existing RL approaches, including GRPO and its variants, lack explicit mechanisms for such self-conditioning, often limiting models to single-pass optimization or auxiliary self-critique tasks. iGRPO operationalizes bootstrapped policy refinement while maintaining computational efficiency, pushing benchmarks on complex verifiable mathematical reasoning tasks.

Methodology

Group Relative Policy Optimization (GRPO) Background

GRPO replaces the value function in standard PPO with group-based normalization. Given a prompt qq, the model samples GG completions, evaluates a scalar reward for each, and computes advantages by normalizing within the group. The update maximizes a clipped surrogate objective, incorporating token-level KL regularization to a reference policy. This formulation is computationally efficient and obviates the need for a learned critic but treats each sampled completion as independent, ignoring potential improvements from iterative refinement.

iGRPO: Dynamic Self-Conditioning Pipeline

iGRPO operates in two stages per optimization step:

  • Stage 1—Exploratory Draft Generation: Multiple drafts are sampled from the current policy. Each is evaluated with a scalar reward, and the highest-scoring draft is selected as a "first-draft" candidate.
  • Stage 2—Conditioned Refinement: The model appends the best draft to the original prompt, forming an augmented context. A group of completions is generated conditioned on this prompt, and a GRPO-style update is applied to these refinements.

This bootstrapped approach ensures that as the policy improves, the quality of conditioning signals (best drafts) increases, amplifying the refinement loop. The process is computationally matched to GRPO—iGRPO simply reallocates the same rollout budget across both stages. Figure 1

Figure 1

Figure 1: Average training reward.

Theoretical Analysis

Monotonic improvement is established under binary rewards: as the policy's success probability increases, the expected quality of selected drafts rises, creating a positive feedback loop. Self-conditioning in iGRPO is policy-dependent—unlike traditional ICL with static demonstrations—so the learning dynamics form a coupled system that facilitates progressive curriculum generation.

Empirical Evaluation

Comparative Model Performance

A thorough suite of experiments on mathematical reasoning benchmarks (AIME24, AIME25, MATH500, AMC, GSM8K, Minerva Math) are conducted across multiple backbone families and scales (7B, 8B, 14B). iGRPO consistently outperforms vanilla GRPO, Self-Verification, and Critique-GRPO under identical training protocols and sampling budgets. Gains are most pronounced in settings where the base model’s single-shot performance is suboptimal, and self-feedback provides a salient scaffold for refinement. Figure 2

Figure 2: Performance of iOpenMath-Nemotron-14B across various pass@N settings for AIME24 and AIME25. Both benchmarks exhibit increasing accuracy with higher NN, though AIME24 quickly stabilizes at 93.33\% by N=16N=16, whereas AIME25 continues to rise until reaching 96.67\% at N=256N=256.

For OpenReasoning-Nemotron-7B trained on AceReason-Math, state-of-the-art results of 85.62\% and 79.64\% are achieved on AIME24/AIME25, respectively, with clear transferability to broader tasks (e.g., GPQA, MMLU-Pro). Ablations demonstrate that the self-feedback stage is a portable refinement layer: applying the same mechanism atop DAPO and GSPO yields +1.1 to +1.2 points in average Pass@1 accuracy, indicating that the iterative wrapper is orthogonal to GRPO-specific objective details.

Learning Dynamics and Resource Efficiency

Policy entropy trajectories reveal that iGRPO sustains higher mid-training entropy, delaying premature mode collapse and preserving exploratory diversity. Memory and throughput measurements confirm that the two-stage mechanism introduces negligible overhead—peak memory remains unchanged and throughput only decreases modestly (~13% more wall-clock training time for substantially improved accuracy). Increasing the number of completions yields diminishing returns beyond 8, suggesting the budget can be judiciously managed according to practical constraints.

Practical and Theoretical Implications

The key implication is that iterative self-feedback mechanisms—modeled after human refinement behavior—can be systematically exploited in RL fine-tuning to advance multi-step reasoning. iGRPO’s design provides a practical, computationally efficient method for bootstrapping policy improvement, preserving alignment with verifiable rewards without requiring auxiliary critique or verification behaviors. The consistent empirical gains across architectures and parameter sizes highlight the method’s generality and scalability. From a theoretical perspective, the bootstrapping effect can guide future development of RL objectives that adapt dynamically with evolving policy strengths rather than relying on static examples or external demonstrations.

Practically, iGRPO is poised for adoption in high-throughput, resource-constrained environments where single-pass inference cannot reliably solve complex reasoning tasks. Its compatibility with generative scalar judges enables refinement for tasks without strict ground-truth solutions, broadening its applicability to domains beyond mathematics. Future directions include integrating richer reward shaping, more sophisticated draft selection, and scaling to even larger models with multi-modal or open-ended reasoning objectives.

Conclusion

Iterative Group Relative Policy Optimization (iGRPO) exemplifies efficient, self-feedback-driven RL for LLM reasoning, achieving robust improvements on mathematical benchmarks and demonstrating portability across group-based PPO variants. The methodology bridges the gap between human iterative problem-solving and LLM training, suggesting iterative refinement as a core principle for verifiable reasoning. With negligible computational overhead and strong empirical performance, iGRPO marks a substantive step toward scalable, self-improving LLMs for complex task domains.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 23 likes about this paper.