Grouped-ROLIE PPO (GRPO) Overview

Updated 16 February 2026

The paper introduces a critic-free, group-normalized advantage estimator that replaces per-step critics with group-level Monte Carlo returns.
It employs per-episode returns to compute group mean and variance, stabilizing learning in short-horizon tasks without a learned value function.
Empirical results indicate GRPO matches or slightly outperforms PPO in environments like CartPole, while struggling in long-horizon or complex tasks.

Grouped-ROLIE PPO (GRPO) is a reinforcement learning (RL) algorithm that eliminates the learned value function (“critic”) and instead estimates advantages by comparing Monte Carlo returns within small groups of trajectories. Originally developed to stabilize RL-from-human-feedback (RLHF) for LLMs via prompt-grouping, GRPO has since been applied to classical control and other domains, often under the name Group Relative Policy Optimization. It is most naturally viewed as a “critic-free, group-normalized” alternative to Proximal Policy Optimization (PPO), which preserves PPO’s clipped surrogate but derives all advantage estimates from per-group statistics.

1. Algorithmic Structure and Mathematical Formulation

GRPO replaces the critic-based, per-step advantage estimator of PPO with a scalar episodic signal normalized within each group of trajectories. At each update, the agent:

Rolls out $G$ complete episodes $\{\tau_i\}_{i=1}^G$ using $G$ parallel environments or by collecting repeated generations per prompt.
For each $\tau_i$ , computes the (possibly discounted) Monte Carlo return:

$R(\tau_i) = \sum_{t=0}^{H_i-1} \gamma^t r_t^i$

Computes group mean and variance:

$\mu_G = \frac{1}{G} \sum_{j=1}^G R(\tau_j), \qquad \sigma_G^2 = \frac{1}{G} \sum_{j=1}^G (R(\tau_j) - \mu_G)^2$

For each trajectory, assigns a constant advantage (per time step):

$\widehat{A}_t^{GRPO}(\tau_i) = \frac{R(\tau_i) - \mu_G}{\sigma_G + \epsilon}$

The standard PPO-style clipped objective is then constructed:

$L(\theta) = \mathbb{E}_{i,t} \Big[ \min\{ r_t(\theta) \widehat{A}_t^{GRPO},\;\operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \widehat{A}_t^{GRPO} \} \Big]$

with $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_{old}}(a_t|s_t)$ .

No auxiliary value-function is learned or applied; all variance reduction derives from groupwise normalization.

2. Comparison to Standard PPO and Classical Baselines

Key distinctions between GRPO and PPO are summarized below:

	PPO	GRPO
Advantage	Stepwise GAE or TD( $\lambda$ ) estimate	Per-episode scalar (group-normalized)
Critic	Learned value network $V_\phi(s)$	None
Baseline	$V_\phi(s_t)$ (state-dependent)	$\mu_G$ (group mean, independent of $t$ )
Credit Assign.	Time-varying, state-dependent	Uniform per episode
Surrogate	Clipped PPO (with optional value loss)	Clipped PPO (no value loss)
Rollout	Fixed $n$ -step trajectories	Full episodes

PPO’s learned critic enables low-variance, fine-grained credit assignment—especially relevant for long-horizon or continuous control. GRPO forfeits temporal granularity, trading off bias for simplicity and often higher sample-efficiency in short-horizon, terminated tasks (Oliveira et al., 5 Nov 2025).

3. Empirical Behavior: Hyperparameters, Trade-offs, and Failure Modes

Extensive ablation studies in classical control establish the following findings (Oliveira et al., 5 Nov 2025):

Group Size ( $G$ ): Smaller groups (e.g., $G=8$ ) yield the best sample efficiency. Larger $G$ reduces baseline noise but drastically limits the frequency of policy updates and, counterintuitively, dilutes the utility of the baseline due to disparate initial states within the group.
Discount Factor ( $\gamma$ ): High values ( $\gamma=0.99$ or $1$) are optimal in tasks with early termination because the episode boundary itself provides signal separation. In very long, non-terminating environments (e.g., HalfCheetah), moderate $\gamma \approx 0.9$ localizes credit assignment to earlier rewards and prevents learning from stalling.
Baselines: All tested critic-free baselines, including group mean, exponential moving average, or even random Gaussian baselines, fail to match PPO in long-horizon environments. Only short-horizon settings (e.g., CartPole) see GRPO matching or outperforming PPO, likely due to PPO overfitting/overtraining tendencies in the toy regime.
Credit Assignment: No explicit temporal credit is possible. The assigned advantage is constant on all time steps for a given trajectory, leading to inefficiencies when rewards are temporally sparse or require stepwise attribution.

Identified limitations include sensitivity to group structure (naive grouping can dilute learning signals), heavy hyperparameter dependence (notably $G$ , $\gamma$ ), and instability or collapse in high-dimensional, partially observed, or non-terminating MDPs.

4. Theoretical Properties and Algorithmic Insights

GRPO’s surrogate can be interpreted as a Monte Carlo approximation of the REINFORCE estimator using a group-standardized baseline, and under mild assumptions, it estimates the policy gradient at the old policy. In practice, regular refreshing of the “old” policy prevents divergent bias. There is no learned value-function, so the algorithm is simple and memory-efficient.

Hybrid extensions introduce limited bootstrapping. For example, Hybrid GRPO replaces the fully empirical advantage with a combination of groupsampled returns and a critic-based value difference, yielding (Sane, 30 Jan 2025):

$A_t^{group} = \frac{1}{N} \sum_{i=1}^N f\big(r(s_t, a_t^{(i)})\big) + \gamma V(s_{t+1}) - V(s_t)$

where $f$ is a reward transformation. This structured approach rebalances bias and variance and can restore learning stability in challenging environments.

5. Applicability and Empirical Results

GRPO achieves best-in-class performance—relative to PPO—in short, early-termination environments such as CartPole, where trajectory-level returns suffice for reliable credit assignment. In middle-ground domains (Acrobot, MountainCarContinuous), GRPO demonstrates non-zero but slower progress, never matching the sample efficiency of PPO. In long-horizon or continuous tasks (HalfCheetah, Humanoid), GRPO learning is unstable or altogether fails when $\gamma=1$ ; PPO’s learned critic is indispensable.

Environment	PPO performance	GRPO performance	Optimal $\gamma$	Optimal $G$
CartPole	High	Matches/slightly >	0.99–1	8
Acrobot, MountainCarContinuous	High	Lower, slow	0.99–1	8
HalfCheetah, Humanoid	Highest	Fails (GRPO, γ=1)	0.9–0.95 (HalfCheetah)	8

6. Relationship to Preference-Based and Multi-Sample RL

GRPO’s group-normalization is especially well-matched to settings with many parallel rollouts from a common context, as in RLHF for LLMs or batched prompt sampling. In RLHF, it avoids training a value network, making large-scale deployment feasible.

However, the group-relative baseline introduces subtleties. For ordinal rewards or partial credit, GRPO may positively reinforce failed or substandard solutions if their return exceeds a (possibly negative) group mean—leading to undesirable policies (Garg et al., 6 Nov 2025). Corrections such as Correctness-relative Policy Optimization (CoRPO) enforce minimum quality thresholds within the baseline to circumvent this pathology.

Hybrid and extension methods, e.g., Hybrid GRPO or Regressive-GRPO, embed a Monte Carlo-based baseline within a value-function-predicted structure or recast the surrogate as a regression on group-normalized advantages, addressing gradient starvation and hyperparameter sensitivity (Sane, 30 Jan 2025, Park et al., 9 Jun 2025).

7. Practical Considerations and Recommendations

When to use GRPO: Appropriate where episodes are short, rewards are dense, or resource constraints preclude training a critic. Particularly effective with clear episode boundaries and batched or grouped data.
When to avoid GRPO: In tasks with long, non-terminating horizons, complex temporal dependencies, or partial observability; the absence of temporally- and state-resolved credit assignment leads to poor performance.
Tuning: Prioritize small group sizes ( $G=8$ or minimum practical), $γ=0.99$ where episodes terminate, $γ \sim 0.9$ where they do not. Avoid excessively large batch or group sizes, which dilute baseline fidelity.
Safety: Be aware that critic-free, group-based methods can reinforce undesired outcomes if reward scales are not properly aligned or if the group normalization is inappropriately configured.

GRPO’s critic-free, group-based approach offers compelling engineering and computational advantages, but at the cost of limited applicability in complex or long-horizon control problems. Its strength is simplicity and resource efficiency for the subset of RL problems where these design choices are justified (Oliveira et al., 5 Nov 2025, Sane, 30 Jan 2025).