MC-GRPO: Reliable Policy Optimization for LMs
- MC-GRPO is a reinforcement learning method that replaces mean reward normalization with a robust median-based approach to reduce sign flips in policy gradients.
- It leverages median absolute deviation (MAD) for advantage computation, ensuring stable and reliable updates even with small rollout budgets.
- Empirical evaluations demonstrate 2–5% accuracy gains across diverse language models and datasets, with minimal extra inference overhead.
Median-Centered Group Relative Policy Optimization (MC-GRPO) is a reinforcement learning method for LLM training that addresses instability and degraded performance in small-rollout regimes within the Group Relative Policy Optimization (GRPO) framework. By replacing the sample mean reward baseline with a group median, MC-GRPO provides robust advantage normalization, notably reducing the frequency of sign flips in policy gradients induced by outlier rewards. This approach enables efficient and stable learning with tight computational budgets and demonstrates empirical gains across a range of LLM sizes and evaluation tasks (Kim, 30 Jan 2026).
1. Group Relative Policy Optimization (GRPO) and Mean-Baseline Limitations
Group Relative Policy Optimization (GRPO) is a family of PPO-style reinforcement learning objectives designed for batched policy improvement in LLMs. For each prompt , GRPO samples completions (rollouts) from the reference policy , obtaining corresponding scalar rewards . Rather than depending on a learned value function, GRPO estimates the advantage for each trajectory by centering its reward relative to the group mean:
where is the sample standard deviation and is a small constant. Omitting scaling yields the core mean-baseline formulation: This mean normalization stabilizes training when is large, but for small , the sample mean is highly sensitive to outlier rewards. This sensitivity induces "advantage sign flips," in which the sign of is reversed for good or bad rollouts, potentially leading to incorrect update directions and degraded policy learning (Kim, 30 Jan 2026).
2. Median-Centered Advantage: Robustness to Outliers
To mitigate sign flips in scenarios where only a small number of rollouts per prompt is feasible, MC-GRPO replaces the baseline mean with the baseline median. For each prompt, rollouts are sampled; the median reward over these is denoted , and its associated median absolute deviation:
The advantage for each trajectory is computed by: for . With (odd) rollouts, exactly one sample coincides with the median baseline and receives zero advantage, which is excluded from the backward pass. The median and MAD are classical robust statistics, yielding high breakdown points and substantial insensitivity to single outlier rewards. This robustness sharply lowers the probability of group-induced sign flip events, stabilizing the advantage estimate when is small (Kim, 30 Jan 2026).
3. MC-GRPO Algorithmic Procedure
The training algorithm introduces minimal changes relative to standard GRPO, with the key modification occurring in the computation of per-sample advantages and the exclusion of the median rollout from gradient calculation. The steps are as follows:
- For each prompt :
- Sample rollouts for .
- Compute scalar rewards .
- Determine the baseline median and MAD().
- Calculate for each .
- Identify such that , with .
- Define the gradient-contributing index set .
- Formulate the clipped surrogate GRPO loss over and update parameters.
Crucially, this procedure introduces only one extra forward pass (inference) per prompt; the number of samples used in backpropagation remains , thus retaining standard update costs (Kim, 30 Jan 2026).
4. Theoretical and Empirical Analysis of Sign-Flip Reduction
MC-GRPO fundamentally addresses the instability in small-group normalization by leveraging the properties of order statistics. Empirically, standard mean-baseline GRPO exhibits sign-flip rates of $20$– for due to mean sensitivity, compared to with median centering. Experiments demonstrate that artificially injecting random sign flips at rate yields a proportional accuracy degradation, establishing a causal link between baseline estimator robustness and final model performance. By design, median centering nearly eliminates group-induced flips even under adversarial reward outliers (Kim, 30 Jan 2026).
The underlying robust statistics—median and MAD—each possess a breakdown point, compared to the mean and standard deviation which are vulnerable to the effect of a single extreme value. This statistical property ensures that for almost all practical reward distributions encountered in LLM RLHF or similar settings, MC-GRPO maintains correct update sign assignment in the low- regime.
5. Experimental Evaluation: Performance under Small Rollout Budgets
MC-GRPO was assessed on five model–dataset configurations: GSM8K with Qwen3-1.7B and Llama-3.2-3B, and Math-500 with Qwen2.5-Math-1.5B, Qwen3-4B, and Qwen2.5-7B. Across all settings, MC-GRPO demonstrated substantial absolute gains in validation accuracy for small rollout budgets ( or ), with the distinction from large- () closing to within . Detailed results are summarized below:
| Model | Dataset | G | GRPO Accuracy (%) | MC-GRPO Accuracy (%) | Absolute Gain |
|---|---|---|---|---|---|
| Qwen3-1.7B | GSM8K | 2 | 78.92 | 83.54 | +4.62 |
| 4 | 81.34 | 84.01 | +2.67 | ||
| 8 | 84.53 | 84.60 | +0.07 | ||
| Llama-3.2-3B | GSM8K | 2 | 76.57 | 77.93 | +1.36 |
| 4 | 77.33 | 79.68 | +2.35 | ||
| 8 | 79.00 | 79.02 | +0.02 | ||
| Qwen2.5-Math-1.5B | Math-500 | 2 | 65.01 | 68.62 | +3.61 |
| 4 | 67.80 | 70.15 | +2.35 | ||
| 8 | 70.00 | 70.60 | +0.60 | ||
| Qwen3-4B | Math-500 | 2 | 78.80 | 81.92 | +3.12 |
| 4 | 80.21 | 82.80 | +2.59 | ||
| 8 | 80.21 | 82.83 | +2.62 | ||
| Qwen2.5-7B | Math-500 | 2 | 72.20 | 76.80 | +4.60 |
| 4 | 74.40 | 76.60 | +2.20 | ||
| 8 | 75.80 | 77.00 | +1.20 |
The additional inference cost from an extra rollout per prompt is modest (5–10% wall-clock overhead), as the computational cost is dominated by backpropagation.
6. Practical Considerations and Summary
MC-GRPO offers a minimal and practical modification to GRPO-style reinforcement learning for LLMs: replace within-prompt reward mean and standard deviation normalization with median and MAD over rollouts, drop the median completion (zero advantage) from the gradient update, and keep the backward batch size pinned at . This approach ensures robust advantage estimation in low-resource settings, where only small rollout budgets are available. It dramatically reduces sign flips and empirically translates into $2$– accuracy gains for small , with the deficit compared to high-budget () training settings nearly eliminated (Kim, 30 Jan 2026).
Code and additional resources are available at [https://github.com/lotusroot-kim/MC-GRPO].