Generative Relative Policy Optimization
- GRPO is a group-wise reinforcement learning method that normalizes rewards within a set of rollouts to compute relative advantages.
- The median-centered variant (MC-GRPO) employs a robust median-based estimator to reduce variance and prevent sign flips in low-rollout settings.
- Empirical results demonstrate improved accuracy and stability across language, vision, and multimodal tasks, particularly under resource constraints.
Group Relative Policy Optimization (GRPO) is a family of policy gradient algorithms for reinforcement learning (RL) that replaces the conventional value-function-based advantage estimator of Proximal Policy Optimization (PPO) with a group-wise, relative normalization over reward samples. Initially developed for LLM fine-tuning under reinforcement learning from human feedback (RLHF), GRPO’s simplicity and variance reduction have led to wide adoption and a proliferation of theoretical and applied variants across language, vision, and multimodal domains. This article details the standard formulation of GRPO, motivating background, the core design principles, key algorithmic and empirical findings, and the diverse extensions developed to address the method's limitations and adapt it to practical constraints.
1. Core Principles and Standard Objective
GRPO operates by generating a group of trajectories ("rollouts") for each prompt or initial state, scoring each with a scalar reward, and normalizing these rewards within the group to compute relative advantages. Specifically, given a prompt , a policy generates outputs , each scored . The group mean baseline is
and the advantage estimator is
where is the empirical standard deviation of and is a numerical stabilizer. The policy gradient update is then given by
where is the computed advantage for trajectory (Kim, 30 Jan 2026, Li et al., 26 Mar 2025).
GRPO enforces purely relative learning; the policy is updated to prefer above-average completions within a group for each prompt, thus being robust to both reward scale and offset.
2. Limitations of Mean Baseline at Small Group Size
In the low-rollout regime (), mean-baselined GRPO is prone to high variance due to outlier sensitivity. When the group size is small, a single outlier can dominate the mean, resulting in advantage sign flips: trajectories that are actually of above-average quality may be treated as below-average (and vice versa), leading to incorrect gradient directions. Empirically, with , the rate of sign disagreement between the sign of and the "oracle" sign (from a large group) can exceed 15%, which translates to empirical task accuracy drops of several percentage points, even at sign-flip rates as low as 5% (Kim, 30 Jan 2026).
3. Median-Centered GRPO (MC-GRPO) for Robust Advantage Estimation
MC-GRPO addresses the failure mode of the sample mean baseline under small group sizes by replacing it with a robust median-based estimator. The procedure:
- Draw samples per prompt, each scored .
- Let .
- Define Median Absolute Deviation (MAD): .
- Compute advantages as .
By construction, one rollout has (the "pivot" at the group median): this sample is excluded from the policy gradient computation, preserving active gradient contributors and maintaining computational efficiency (Kim, 30 Jan 2026).
MC-GRPO's mechanics are summarized in the table below:
| Step | Vanilla GRPO | MC-GRPO |
|---|---|---|
| Baseline | Mean of rewards | Median of rewards |
| Advantage scale | Empirical standard deviation | Median Absolute Deviation (MAD) |
| Samples used | All rollouts | drawn; median sample discarded, used |
| Update cost | backprop per prompt | backprop per prompt (1 extra forward per prompt) |
The median-based estimator is significantly less sensitive to outliers than the mean, reducing the rate of sign flips, and thus leads to more stable optimization and improved generalization in resource-constrained settings.
4. Empirical Findings and Benchmark Results
MC-GRPO demonstrates substantial empirical gains in the small-rollout regime:
- On GSM8K with Qwen3-1.7B at , MC-GRPO improves exact-match accuracy from 78.9% (mean baseline GRPO) to 83.5% (+4.6% absolute). At , it improves from 81.3% to 84.0% (+2.7%). The accuracy gap between and shrinks from 5.6% to 1.0% (Kim, 30 Jan 2026).
- Robustness to outlier-driven sign flips is preserved across model scales (Qwen3-1.7B, Llama-3.2-3B, Qwen2.5-Math variants) and datasets (GSM8K, Math-500, OOD contests).
- Training curves are smoother and converge faster; marginal cost of the extra rollout is negligible in common high-throughput setups.
These improvements are observed not only in the base GRPO but also in MC-variants of related group-based policy optimization methods such as DAPO and DR-GRPO.
5. Algorithmic Implementation and Pseudocode
Algorithmic steps for MC-GRPO:
- Sampling: For each prompt , sample completions from .
- Reward computation: Score completions .
- Baseline: Set , .
- Advantage calculation: for all .
- Pivot identification: Find s.t. ; drop this sample from updates.
- Policy update: Compute PPO-style clipped surrogate loss over the remaining rollouts.
This structure ensures the effective batch size for backpropagation is unchanged relative to vanilla GRPO.
6. Practical Recommendations and Extensions
- MC-GRPO is recommended whenever rollout sampling budget restricts to 2–4 per prompt. The method is a drop-in replacement for within-prompt mean baselines in GRPO-style code paths.
- The cost of sampling the one extra rollout is negligible for batched high-throughput inference.
- At larger (8 or more), the mean provides a sufficiently accurate baseline; median centering gives diminishing returns.
- MC-GRPO is orthogonal to regularization (KL penalties), sequence-level PPO variants, curriculum selection, and reward shaping, and can be combined freely with these additional techniques (Kim, 30 Jan 2026).
7. Summary and Theoretical Significance
MC-GRPO constitutes a minimal yet powerful modification to GRPO, targeting the instability that arises from mean baseline variance in the small-batch regime. By introducing a robust group-centered baseline, it eliminates the primary source of bias—sign flips in the advantage function—conferring notable stability and accuracy enhancements in compute-constrained reinforcement learning. This shift improves the reliability of group-relative RL pipelines for LLM alignment and generalizes easily to other group-based advantage frameworks (Kim, 30 Jan 2026).