Papers
Topics
Authors
Recent
Search
2000 character limit reached

MC-GRPO: Reliable Policy Optimization for LMs

Updated 6 February 2026
  • MC-GRPO is a reinforcement learning method that replaces mean reward normalization with a robust median-based approach to reduce sign flips in policy gradients.
  • It leverages median absolute deviation (MAD) for advantage computation, ensuring stable and reliable updates even with small rollout budgets.
  • Empirical evaluations demonstrate 2–5% accuracy gains across diverse language models and datasets, with minimal extra inference overhead.

Median-Centered Group Relative Policy Optimization (MC-GRPO) is a reinforcement learning method for LLM training that addresses instability and degraded performance in small-rollout regimes within the Group Relative Policy Optimization (GRPO) framework. By replacing the sample mean reward baseline with a group median, MC-GRPO provides robust advantage normalization, notably reducing the frequency of sign flips in policy gradients induced by outlier rewards. This approach enables efficient and stable learning with tight computational budgets and demonstrates empirical gains across a range of LLM sizes and evaluation tasks (Kim, 30 Jan 2026).

1. Group Relative Policy Optimization (GRPO) and Mean-Baseline Limitations

Group Relative Policy Optimization (GRPO) is a family of PPO-style reinforcement learning objectives designed for batched policy improvement in LLMs. For each prompt qq, GRPO samples GG completions (rollouts) o1,,oGo_1, \ldots, o_G from the reference policy πθold\pi_{\theta_\text{old}}, obtaining corresponding scalar rewards ri=R(q,oi)r_i = R(q, o_i). Rather than depending on a learned value function, GRPO estimates the advantage for each trajectory by centering its reward relative to the group mean: rˉ(q)=1Gj=1Grj\bar{r}(q) = \frac{1}{G} \sum_{j=1}^{G} r_j

Ai=rirˉ(q)sr(q)+ϵA_i = \frac{r_i - \bar{r}(q)}{s_r(q) + \epsilon}

where sr(q)s_r(q) is the sample standard deviation and ϵ\epsilon is a small constant. Omitting scaling yields the core mean-baseline formulation: Ai=ri1Gj=1GrjA_i = r_i - \frac{1}{G} \sum_{j=1}^{G} r_j This mean normalization stabilizes training when GG is large, but for small GG, the sample mean is highly sensitive to outlier rewards. This sensitivity induces "advantage sign flips," in which the sign of AiA_i is reversed for good or bad rollouts, potentially leading to incorrect update directions and degraded policy learning (Kim, 30 Jan 2026).

2. Median-Centered Advantage: Robustness to Outliers

To mitigate sign flips in scenarios where only a small number of rollouts per prompt is feasible, MC-GRPO replaces the baseline mean with the baseline median. For each prompt, G+1G+1 rollouts are sampled; the median reward over these is denoted b(q)b(q), and its associated median absolute deviation: b(q)=median(r1,,rG+1)b(q) = \mathrm{median}(r_1, \ldots, r_{G+1})

MAD(r)=median(rib(q)),i=1,,G+1\mathrm{MAD}(r) = \mathrm{median}(|r_i - b(q)|), \quad i = 1,\ldots,G+1

The advantage for each trajectory is computed by: Ai=rib(q)MAD(r)+εA_i = \frac{r_i - b(q)}{\mathrm{MAD}(r) + \varepsilon} for i=1,,G+1i = 1, \ldots, G+1. With G+1G+1 (odd) rollouts, exactly one sample coincides with the median baseline and receives zero advantage, which is excluded from the backward pass. The median and MAD are classical robust statistics, yielding high breakdown points and substantial insensitivity to single outlier rewards. This robustness sharply lowers the probability of group-induced sign flip events, stabilizing the advantage estimate when GG is small (Kim, 30 Jan 2026).

3. MC-GRPO Algorithmic Procedure

The training algorithm introduces minimal changes relative to standard GRPO, with the key modification occurring in the computation of per-sample advantages and the exclusion of the median rollout from gradient calculation. The steps are as follows:

  1. For each prompt qq:
    • Sample G+1G+1 rollouts oiπθold(q)o_i \sim \pi_{\theta_\text{old}}(\cdot | q) for i=1,,G+1i = 1, \ldots, G+1.
    • Compute scalar rewards rir_i.
    • Determine the baseline median b(q)b(q) and MAD(rr).
    • Calculate Ai=(rib(q))/(MAD(r)+ϵ0)A_i = (r_i - b(q)) / (\mathrm{MAD}(r) + \epsilon_0) for each ii.
    • Identify ii^* such that ri=b(q)r_{i^*} = b(q), with Ai=0A_{i^*} = 0.
    • Define the gradient-contributing index set I(q)={1,,G+1}{i}I(q) = \{1, \ldots, G+1\} \setminus \{i^*\}.
    • Formulate the clipped surrogate GRPO loss over iI(q)i \in I(q) and update parameters.

Crucially, this procedure introduces only one extra forward pass (inference) per prompt; the number of samples used in backpropagation remains GG, thus retaining standard update costs (Kim, 30 Jan 2026).

4. Theoretical and Empirical Analysis of Sign-Flip Reduction

MC-GRPO fundamentally addresses the instability in small-group normalization by leveraging the properties of order statistics. Empirically, standard mean-baseline GRPO exhibits sign-flip rates of $20$–30%30\% for G=2G=2 due to mean sensitivity, compared to <5%<5\% with median centering. Experiments demonstrate that artificially injecting random sign flips at rate ρ\rho yields a proportional accuracy degradation, establishing a causal link between baseline estimator robustness and final model performance. By design, median centering nearly eliminates group-induced flips even under adversarial reward outliers (Kim, 30 Jan 2026).

The underlying robust statistics—median and MAD—each possess a 50%50\% breakdown point, compared to the mean and standard deviation which are vulnerable to the effect of a single extreme value. This statistical property ensures that for almost all practical reward distributions encountered in LLM RLHF or similar settings, MC-GRPO maintains correct update sign assignment in the low-GG regime.

5. Experimental Evaluation: Performance under Small Rollout Budgets

MC-GRPO was assessed on five model–dataset configurations: GSM8K with Qwen3-1.7B and Llama-3.2-3B, and Math-500 with Qwen2.5-Math-1.5B, Qwen3-4B, and Qwen2.5-7B. Across all settings, MC-GRPO demonstrated substantial absolute gains in validation accuracy for small rollout budgets (G=2G=2 or G=4G=4), with the distinction from large-GG (G=8G=8) closing to within 1%\sim 1\%. Detailed results are summarized below:

Model Dataset G GRPO Accuracy (%) MC-GRPO Accuracy (%) Absolute Gain
Qwen3-1.7B GSM8K 2 78.92 83.54 +4.62
4 81.34 84.01 +2.67
8 84.53 84.60 +0.07
Llama-3.2-3B GSM8K 2 76.57 77.93 +1.36
4 77.33 79.68 +2.35
8 79.00 79.02 +0.02
Qwen2.5-Math-1.5B Math-500 2 65.01 68.62 +3.61
4 67.80 70.15 +2.35
8 70.00 70.60 +0.60
Qwen3-4B Math-500 2 78.80 81.92 +3.12
4 80.21 82.80 +2.59
8 80.21 82.83 +2.62
Qwen2.5-7B Math-500 2 72.20 76.80 +4.60
4 74.40 76.60 +2.20
8 75.80 77.00 +1.20

The additional inference cost from an extra rollout per prompt is modest (\approx5–10% wall-clock overhead), as the computational cost is dominated by backpropagation.

6. Practical Considerations and Summary

MC-GRPO offers a minimal and practical modification to GRPO-style reinforcement learning for LLMs: replace within-prompt reward mean and standard deviation normalization with median and MAD over G+1G+1 rollouts, drop the median completion (zero advantage) from the gradient update, and keep the backward batch size pinned at GG. This approach ensures robust advantage estimation in low-resource settings, where only small rollout budgets are available. It dramatically reduces sign flips and empirically translates into $2$–5%5\% accuracy gains for small GG, with the deficit compared to high-budget (G=8G=8) training settings nearly eliminated (Kim, 30 Jan 2026).

Code and additional resources are available at [https://github.com/lotusroot-kim/MC-GRPO].

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Median-Centered Group Relative Policy Optimization (MC-GRPO).