Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grouped-ROLIE PPO (GRPO) Overview

Updated 16 February 2026
  • The paper introduces a critic-free, group-normalized advantage estimator that replaces per-step critics with group-level Monte Carlo returns.
  • It employs per-episode returns to compute group mean and variance, stabilizing learning in short-horizon tasks without a learned value function.
  • Empirical results indicate GRPO matches or slightly outperforms PPO in environments like CartPole, while struggling in long-horizon or complex tasks.

Grouped-ROLIE PPO (GRPO) is a reinforcement learning (RL) algorithm that eliminates the learned value function (“critic”) and instead estimates advantages by comparing Monte Carlo returns within small groups of trajectories. Originally developed to stabilize RL-from-human-feedback (RLHF) for LLMs via prompt-grouping, GRPO has since been applied to classical control and other domains, often under the name Group Relative Policy Optimization. It is most naturally viewed as a “critic-free, group-normalized” alternative to Proximal Policy Optimization (PPO), which preserves PPO’s clipped surrogate but derives all advantage estimates from per-group statistics.

1. Algorithmic Structure and Mathematical Formulation

GRPO replaces the critic-based, per-step advantage estimator of PPO with a scalar episodic signal normalized within each group of trajectories. At each update, the agent:

  • Rolls out GG complete episodes {τi}i=1G\{\tau_i\}_{i=1}^G using GG parallel environments or by collecting repeated generations per prompt.
  • For each τi\tau_i, computes the (possibly discounted) Monte Carlo return:

R(τi)=t=0Hi1γtrtiR(\tau_i) = \sum_{t=0}^{H_i-1} \gamma^t r_t^i

  • Computes group mean and variance:

μG=1Gj=1GR(τj),σG2=1Gj=1G(R(τj)μG)2\mu_G = \frac{1}{G} \sum_{j=1}^G R(\tau_j), \qquad \sigma_G^2 = \frac{1}{G} \sum_{j=1}^G (R(\tau_j) - \mu_G)^2

  • For each trajectory, assigns a constant advantage (per time step):

A^tGRPO(τi)=R(τi)μGσG+ϵ\widehat{A}_t^{GRPO}(\tau_i) = \frac{R(\tau_i) - \mu_G}{\sigma_G + \epsilon}

  • The standard PPO-style clipped objective is then constructed:

L(θ)=Ei,t[min{rt(θ)A^tGRPO,  clip(rt(θ),1ϵ,1+ϵ)A^tGRPO}]L(\theta) = \mathbb{E}_{i,t} \Big[ \min\{ r_t(\theta) \widehat{A}_t^{GRPO},\;\operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \widehat{A}_t^{GRPO} \} \Big]

with rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_{old}}(a_t|s_t).

No auxiliary value-function is learned or applied; all variance reduction derives from groupwise normalization.

2. Comparison to Standard PPO and Classical Baselines

Key distinctions between GRPO and PPO are summarized below:

PPO GRPO
Advantage Stepwise GAE or TD(λ\lambda) estimate Per-episode scalar (group-normalized)
Critic Learned value network Vϕ(s)V_\phi(s) None
Baseline Vϕ(st)V_\phi(s_t) (state-dependent) μG\mu_G (group mean, independent of tt)
Credit Assign. Time-varying, state-dependent Uniform per episode
Surrogate Clipped PPO (with optional value loss) Clipped PPO (no value loss)
Rollout Fixed nn-step trajectories Full episodes

PPO’s learned critic enables low-variance, fine-grained credit assignment—especially relevant for long-horizon or continuous control. GRPO forfeits temporal granularity, trading off bias for simplicity and often higher sample-efficiency in short-horizon, terminated tasks (Oliveira et al., 5 Nov 2025).

3. Empirical Behavior: Hyperparameters, Trade-offs, and Failure Modes

Extensive ablation studies in classical control establish the following findings (Oliveira et al., 5 Nov 2025):

  • Group Size (GG): Smaller groups (e.g., G=8G=8) yield the best sample efficiency. Larger GG reduces baseline noise but drastically limits the frequency of policy updates and, counterintuitively, dilutes the utility of the baseline due to disparate initial states within the group.
  • Discount Factor (γ\gamma): High values (γ=0.99\gamma=0.99 or $1$) are optimal in tasks with early termination because the episode boundary itself provides signal separation. In very long, non-terminating environments (e.g., HalfCheetah), moderate γ0.9\gamma \approx 0.9 localizes credit assignment to earlier rewards and prevents learning from stalling.
  • Baselines: All tested critic-free baselines, including group mean, exponential moving average, or even random Gaussian baselines, fail to match PPO in long-horizon environments. Only short-horizon settings (e.g., CartPole) see GRPO matching or outperforming PPO, likely due to PPO overfitting/overtraining tendencies in the toy regime.
  • Credit Assignment: No explicit temporal credit is possible. The assigned advantage is constant on all time steps for a given trajectory, leading to inefficiencies when rewards are temporally sparse or require stepwise attribution.

Identified limitations include sensitivity to group structure (naive grouping can dilute learning signals), heavy hyperparameter dependence (notably GG, γ\gamma), and instability or collapse in high-dimensional, partially observed, or non-terminating MDPs.

4. Theoretical Properties and Algorithmic Insights

GRPO’s surrogate can be interpreted as a Monte Carlo approximation of the REINFORCE estimator using a group-standardized baseline, and under mild assumptions, it estimates the policy gradient at the old policy. In practice, regular refreshing of the “old” policy prevents divergent bias. There is no learned value-function, so the algorithm is simple and memory-efficient.

Hybrid extensions introduce limited bootstrapping. For example, Hybrid GRPO replaces the fully empirical advantage with a combination of groupsampled returns and a critic-based value difference, yielding (Sane, 30 Jan 2025):

Atgroup=1Ni=1Nf(r(st,at(i)))+γV(st+1)V(st)A_t^{group} = \frac{1}{N} \sum_{i=1}^N f\big(r(s_t, a_t^{(i)})\big) + \gamma V(s_{t+1}) - V(s_t)

where ff is a reward transformation. This structured approach rebalances bias and variance and can restore learning stability in challenging environments.

5. Applicability and Empirical Results

GRPO achieves best-in-class performance—relative to PPO—in short, early-termination environments such as CartPole, where trajectory-level returns suffice for reliable credit assignment. In middle-ground domains (Acrobot, MountainCarContinuous), GRPO demonstrates non-zero but slower progress, never matching the sample efficiency of PPO. In long-horizon or continuous tasks (HalfCheetah, Humanoid), GRPO learning is unstable or altogether fails when γ=1\gamma=1; PPO’s learned critic is indispensable.

Environment PPO performance GRPO performance Optimal γ\gamma Optimal GG
CartPole High Matches/slightly > 0.99–1 8
Acrobot, MountainCarContinuous High Lower, slow 0.99–1 8
HalfCheetah, Humanoid Highest Fails (GRPO, γ=1) 0.9–0.95 (HalfCheetah) 8

6. Relationship to Preference-Based and Multi-Sample RL

GRPO’s group-normalization is especially well-matched to settings with many parallel rollouts from a common context, as in RLHF for LLMs or batched prompt sampling. In RLHF, it avoids training a value network, making large-scale deployment feasible.

However, the group-relative baseline introduces subtleties. For ordinal rewards or partial credit, GRPO may positively reinforce failed or substandard solutions if their return exceeds a (possibly negative) group mean—leading to undesirable policies (Garg et al., 6 Nov 2025). Corrections such as Correctness-relative Policy Optimization (CoRPO) enforce minimum quality thresholds within the baseline to circumvent this pathology.

Hybrid and extension methods, e.g., Hybrid GRPO or Regressive-GRPO, embed a Monte Carlo-based baseline within a value-function-predicted structure or recast the surrogate as a regression on group-normalized advantages, addressing gradient starvation and hyperparameter sensitivity (Sane, 30 Jan 2025, Park et al., 9 Jun 2025).

7. Practical Considerations and Recommendations

  • When to use GRPO: Appropriate where episodes are short, rewards are dense, or resource constraints preclude training a critic. Particularly effective with clear episode boundaries and batched or grouped data.
  • When to avoid GRPO: In tasks with long, non-terminating horizons, complex temporal dependencies, or partial observability; the absence of temporally- and state-resolved credit assignment leads to poor performance.
  • Tuning: Prioritize small group sizes (G=8G=8 or minimum practical), γ=0.99γ=0.99 where episodes terminate, γ0.9γ \sim 0.9 where they do not. Avoid excessively large batch or group sizes, which dilute baseline fidelity.
  • Safety: Be aware that critic-free, group-based methods can reinforce undesired outcomes if reward scales are not properly aligned or if the group normalization is inappropriately configured.

GRPO’s critic-free, group-based approach offers compelling engineering and computational advantages, but at the cost of limited applicability in complex or long-horizon control problems. Its strength is simplicity and resource efficiency for the subset of RL problems where these design choices are justified (Oliveira et al., 5 Nov 2025, Sane, 30 Jan 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped-ROLIE PPO (GRPO).