Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative Relative Policy Optimization

Updated 9 February 2026
  • GRPO is a group-wise reinforcement learning method that normalizes rewards within a set of rollouts to compute relative advantages.
  • The median-centered variant (MC-GRPO) employs a robust median-based estimator to reduce variance and prevent sign flips in low-rollout settings.
  • Empirical results demonstrate improved accuracy and stability across language, vision, and multimodal tasks, particularly under resource constraints.

Group Relative Policy Optimization (GRPO) is a family of policy gradient algorithms for reinforcement learning (RL) that replaces the conventional value-function-based advantage estimator of Proximal Policy Optimization (PPO) with a group-wise, relative normalization over reward samples. Initially developed for LLM fine-tuning under reinforcement learning from human feedback (RLHF), GRPO’s simplicity and variance reduction have led to wide adoption and a proliferation of theoretical and applied variants across language, vision, and multimodal domains. This article details the standard formulation of GRPO, motivating background, the core design principles, key algorithmic and empirical findings, and the diverse extensions developed to address the method's limitations and adapt it to practical constraints.

1. Core Principles and Standard Objective

GRPO operates by generating a group of GG trajectories ("rollouts") for each prompt or initial state, scoring each with a scalar reward, and normalizing these rewards within the group to compute relative advantages. Specifically, given a prompt qq, a policy πθ\pi_\theta generates GG outputs {o1,,oG}\{o_{1}, \dots, o_{G}\}, each scored ri=R(q,oi)r_{i} = R(q, o_{i}). The group mean baseline is

b(q)=1Gj=1Grj,b(q) = \frac{1}{G} \sum_{j=1}^{G} r_j,

and the advantage estimator is

Ai=rib(q)sr(q)+ϵ,A_{i} = \frac{r_{i} - b(q)}{s_r(q) + \epsilon},

where sr(q)s_r(q) is the empirical standard deviation of {rj}\{r_j\} and ϵ\epsilon is a numerical stabilizer. The policy gradient update is then given by

θJ(θ)=Eτπθ[A(τ) θlogπθ(τ)],\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ A(\tau)\ \nabla_\theta \log \pi_\theta(\tau) \right],

where A(τ)A(\tau) is the computed advantage for trajectory τ\tau (Kim, 30 Jan 2026, Li et al., 26 Mar 2025).

GRPO enforces purely relative learning; the policy is updated to prefer above-average completions within a group for each prompt, thus being robust to both reward scale and offset.

2. Limitations of Mean Baseline at Small Group Size

In the low-rollout regime (G4G \leq 4), mean-baselined GRPO is prone to high variance due to outlier sensitivity. When the group size is small, a single outlier can dominate the mean, resulting in advantage sign flips: trajectories that are actually of above-average quality may be treated as below-average (and vice versa), leading to incorrect gradient directions. Empirically, with G=2G=2, the rate of sign disagreement between the sign of AiA_{i} and the "oracle" sign (from a large group) can exceed 15%, which translates to empirical task accuracy drops of several percentage points, even at sign-flip rates as low as 5% (Kim, 30 Jan 2026).

3. Median-Centered GRPO (MC-GRPO) for Robust Advantage Estimation

MC-GRPO addresses the failure mode of the sample mean baseline under small group sizes by replacing it with a robust median-based estimator. The procedure:

  • Draw G+1G+1 samples per prompt, each scored rir_{i}.
  • Let bmed(q)=median{r1,,rG+1}b_\mathrm{med}(q) = \operatorname{median}\{ r_1, \dots, r_{G+1} \}.
  • Define Median Absolute Deviation (MAD): MAD(r)=median{ribmed}MAD(r) = \operatorname{median}\{ |r_i - b_\mathrm{med}| \}.
  • Compute advantages as Ai=(ribmed)/(MAD(r)+ϵ)A_{i} = (r_{i} - b_\mathrm{med}) / (MAD(r) + \epsilon).

By construction, one rollout has Ai=0A_{i^\ast} = 0 (the "pivot" at the group median): this sample is excluded from the policy gradient computation, preserving GG active gradient contributors and maintaining computational efficiency (Kim, 30 Jan 2026).

MC-GRPO's mechanics are summarized in the table below:

Step Vanilla GRPO MC-GRPO
Baseline Mean of rewards Median of rewards
Advantage scale Empirical standard deviation Median Absolute Deviation (MAD)
Samples used All GG rollouts G+1G+1 drawn; median sample discarded, GG used
Update cost GG backprop per prompt GG backprop per prompt (1 extra forward per prompt)

The median-based estimator is significantly less sensitive to outliers than the mean, reducing the rate of sign flips, and thus leads to more stable optimization and improved generalization in resource-constrained settings.

4. Empirical Findings and Benchmark Results

MC-GRPO demonstrates substantial empirical gains in the small-rollout regime:

  • On GSM8K with Qwen3-1.7B at G=2G=2, MC-GRPO improves exact-match accuracy from 78.9% (mean baseline GRPO) to 83.5% (+4.6% absolute). At G=4G=4, it improves from 81.3% to 84.0% (+2.7%). The accuracy gap between G=2G=2 and G=8G=8 shrinks from 5.6% to 1.0% (Kim, 30 Jan 2026).
  • Robustness to outlier-driven sign flips is preserved across model scales (Qwen3-1.7B, Llama-3.2-3B, Qwen2.5-Math variants) and datasets (GSM8K, Math-500, OOD contests).
  • Training curves are smoother and converge faster; marginal cost of the extra rollout is negligible in common high-throughput setups.

These improvements are observed not only in the base GRPO but also in MC-variants of related group-based policy optimization methods such as DAPO and DR-GRPO.

5. Algorithmic Implementation and Pseudocode

Algorithmic steps for MC-GRPO:

  1. Sampling: For each prompt qq, sample G+1G+1 completions {o1,...,oG+1}\{o_1, ..., o_{G+1}\} from πθold\pi_{\theta_\text{old}}.
  2. Reward computation: Score completions ri=R(q,oi)r_i = R(q, o_i).
  3. Baseline: Set b=median(r1,...,rG+1)b = \operatorname{median}(r_1, ..., r_{G+1}), σ=MAD(r)+ϵ\sigma = MAD(r) + \epsilon.
  4. Advantage calculation: Ai=(rib)/σA_i = (r_i - b) / \sigma for all ii.
  5. Pivot identification: Find ii^\ast s.t. Ai=0A_{i^\ast} = 0; drop this sample from updates.
  6. Policy update: Compute PPO-style clipped surrogate loss over the remaining GG rollouts.

This structure ensures the effective batch size for backpropagation is unchanged relative to vanilla GRPO.

6. Practical Recommendations and Extensions

  • MC-GRPO is recommended whenever rollout sampling budget restricts GG to 2–4 per prompt. The method is a drop-in replacement for within-prompt mean baselines in GRPO-style code paths.
  • The cost of sampling the one extra rollout is negligible for batched high-throughput inference.
  • At larger GG (8 or more), the mean provides a sufficiently accurate baseline; median centering gives diminishing returns.
  • MC-GRPO is orthogonal to regularization (KL penalties), sequence-level PPO variants, curriculum selection, and reward shaping, and can be combined freely with these additional techniques (Kim, 30 Jan 2026).

7. Summary and Theoretical Significance

MC-GRPO constitutes a minimal yet powerful modification to GRPO, targeting the instability that arises from mean baseline variance in the small-batch regime. By introducing a robust group-centered baseline, it eliminates the primary source of bias—sign flips in the advantage function—conferring notable stability and accuracy enhancements in compute-constrained reinforcement learning. This shift improves the reliability of group-relative RL pipelines for LLM alignment and generalizes easily to other group-based advantage frameworks (Kim, 30 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Relative Policy Optimization (GRPO).