Papers
Topics
Authors
Recent
Search
2000 character limit reached

AMIR-GRPO: Advanced RL Alignment Algorithm

Updated 14 January 2026
  • AMIR-GRPO is a reinforcement learning alignment algorithm for LLMs that leverages implicit pairwise preferences to optimize group-level rollouts.
  • It employs a DPO-style contrastive regularizer to densify supervision, reduce gradient variance, and enforce clear decision boundaries.
  • Empirical results demonstrate notable improvements in Pass@1 scores and sample efficiency across diverse mathematical reasoning benchmarks.

AMIR-GRPO (Augmented Merit-guided Implicit-preference Regularization for Group Relative Policy Optimization) is an advanced reinforcement learning alignment algorithm for LLMs focused on complex reasoning tasks. It extends standard GRPO by leveraging implicit pairwise preference signals latent in group-level rollout rewards to construct a contrastive regularizer, thereby amplifying policy supervision and mitigating structural biases.

1. Foundations and Motivation

Group Relative Policy Optimization (GRPO) aligns LLM policies through reinforcement learning using group-based evaluation. For each prompt qq, GRPO samples a group of GG completions {oi}i=1G\{o_i\}_{i=1}^G from the current policy πθold\pi_{\theta_{\rm old}}, independently assigns scalar rewards rir_i, and computes a group-normalized sequence-level advantage for each completion:

Ai=riμrσrwhereμr=1Gj=1Grj,  σr=1Gj(rjμr)2A_i = \frac{r_i - \mu_r}{\sigma_r} \quad \text{where} \quad \mu_r = \frac{1}{G} \sum_{j=1}^G r_j, \; \sigma_r = \sqrt{\frac{1}{G} \sum_j (r_j-\mu_r)^2}

These advantages modulate PPO-style surrogate rewards in policy optimization.

GRPO's scalar rewards and normalization induce critical limitations:

  • Sequence-length bias: Positive advantages disproportionately encourage shorter completions; penalties on long low-reward trajectories are diluted.
  • Sparse-reward inefficacy: Negative advantages assigned to many incorrect rollouts become numerically weak after normalization, limiting effective suppression.
  • Preference information compression: Each group potentially contains O(G2)O(G^2) pairwise ordering constraints between completions, yet only O(G)O(G) scalar signals are utilized.

AMIR-GRPO exploits these implicit intra-group preferences to densify the supervision signal and improve policy discrimination without requiring added human labeling (Yari et al., 7 Jan 2026).

2. Objective Formulation and Implicit-Preference Construction

2.1 GRPO Surrogate Objective

Let πθ\pi_\theta denote the current model, πref\pi_{\rm ref} a reference policy, and ϵ\epsilon the PPO clipping threshold. The GRPO objective maximizes:

JGRPO(θ)=Eq,{oi}i,tmin[ri,t(θ)Ai,clip(ri,t(θ),1ϵ,1+ϵ)Ai]γDKL(πθπref)J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}}\sum_{i,t}\min\left[ r_{i,t}(\theta)A_i,\, \mathrm{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon)A_i \right] - \gamma D_{\rm KL}(\pi_\theta\|\pi_{\rm ref})

with

ri,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}\mid q,\,o_{i,<t})}{\pi_{\theta_{\rm old}}(o_{i,t}\mid q,\,o_{i,<t})}

and reference KL penalty.

2.2 Implicit Preference Set

From the reward group {ri}\{r_i\}, AMIR-GRPO infers the set of robust pairwise preferences: S(q)={(i,j)ri>rj+δ}S(q) = \{(i,j)\mid r_i > r_j + \delta \} where δ\delta is a reward margin threshold excluding near-ties. Each pair encodes a supervision that oio_i is strictly preferable to ojo_j.

2.3 DPO-Style Contrastive Regularizer

For every pair (i,j)S(q)(i,j)\in S(q), a DPO-style contrastive logit is computed: Zi,j(θ)=β[(logπθ(oiq)logπref(oiq))(logπθ(ojq)logπref(ojq))]Z_{i,j}(\theta) = \beta \left[ (\log\pi_\theta(o_i\mid q) - \log\pi_{\rm ref}(o_i\mid q)) - (\log\pi_\theta(o_j\mid q) - \log\pi_{\rm ref}(o_j\mid q)) \right] with temperature β\beta. The regularization loss is: Lcontrastive(θ)=1S(i,j)Slogσ(Zi,j(θ))L_{\rm contrastive}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log \sigma(Z_{i,j}(\theta)) where σ\sigma denotes the sigmoid function.

2.4 Combined AMIR-GRPO Objective

The final learning objective is: JAMIRGRPO(θ)=JGRPO(θ)+αregJpref(θ)J_{\rm AMIR-GRPO}(\theta) = J_{\rm GRPO}(\theta) + \alpha_{\rm reg} J_{\rm pref}(\theta) where αreg\alpha_{\rm reg} adaptively balances the regularizer according to the batchwise ratio ρ=Jpref/JGRPO\rho = |J_{\rm pref}/J_{\rm GRPO}|.

3. Algorithmic Procedure

AMIR-GRPO is implemented iteratively with the following steps:

  • Sample minibatches of prompts and generate GG completions per prompt under πθold\pi_{\theta_{\rm old}}.
  • Compute normalized advantages AiA_i and construct the implicit preference set S(q)S(q) for each prompt.
  • Calculate the clipped GRPO surrogate objective.
  • Compute the contrastive regularizer across all pairwise preferences using DPO-style logits.
  • Dynamically adjust αreg\alpha_{\rm reg} to maintain a target ratio between preference regularizer and surrogate objective.
  • Combine all terms and backpropagate gradients to update θ\theta; periodically sync πθold\pi_{\theta_{\rm old}}.

Stabilization measures include margin filtering (δ\delta), length normalization in log-probabilities, PPO-style trust-region clipping (ϵ\epsilon), and online adaptation of regularization strength.

4. Theoretical Properties

AMIR-GRPO leverages implicit preference relations, yielding increased supervision density compared to vanilla GRPO:

  • Supervision richness: Each group instantiates up to O(G2)O(G^2) preference constraints (vs. O(G)O(G) scalar advantages).
  • Gradient variance reduction: Amplified penalties on low-reward completions via multiple contrastive judgments sharpen learning and suppress poor trajectories.
  • Margin enforcement: Contrastive DPO-style updates enforce separation in log-probability space, tightening boundaries between high-reward and low-reward completions.
  • The core GRPO component, with PPO-style clipping and KL, inherits trust-region convergence properties.

AMIR-GRPO’s alignment mechanism fundamentally diverges from exponential/logarithmic pooling approaches (see classical RLHF):

  • Standard RLHF uses an objective of the form JRLHF(θ)=E[rϕ(oq)]βKL(πθπref)J_{\rm RLHF}(\theta) = \mathbb{E}[r_\phi(o|q)] - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\rm ref}), with stationary solution

πθ(oq)πref(oq)exp[rϕ(oq)/β]\pi_\theta(o|q) \propto \pi_{\rm ref}(o|q)\exp[r_\phi(o|q)/\beta]

(i.e., geometric pooling).

  • By contrast, GRPO and AMIR-GRPO yield a rational multiplicative update, not exponential, resulting in non-logarithmic aggregation of reference policy and group-based preference signals (Vojnovic et al., 25 Feb 2025).

5. Empirical Evaluation

5.1 Datasets and Metrics

Benchmarking spans in-distribution and out-of-distribution mathematical reasoning task suites:

  • Training on GSM8K (grade-school math), evaluation on GSM8K test, AIME 2025, OlympiadBench, AMC23, MinervaMath, AQUA-RAT, and LiveMathBench.
  • Main metric is Pass@k, estimated by unbiased combinatorial sampling (n=8n=8 rollouts/question, kk correct ones).

5.2 Main Results

AMIR-GRPO achieves consistent improvements over GRPO across multiple architectures (Qwen2.5-3B, Qwen2.5-7B, Gemma-4B). Representative gains:

Dataset/Model GRPO Pass@1 AMIR-GRPO Pass@1 Δ Pass@1
AMC23/7B 40.5% 43.2% +2.7
LiveMath/4B 19.0% 23.1% +4.1

Pass@2 and Pass@4 improvements are proportionally larger, indicating enhanced sample efficiency and expanded coverage: AMIR-GRPO solves 8.8% more problems exclusively than GRPO on AMC23 (Yari et al., 7 Jan 2026).

5.3 Ablation and Trajectory Analysis

  • Temperature (β\beta) effects: Lower β\beta stabilizes smaller models; intermediate β0.5\beta\sim0.5 is optimal on 7B-class; higher β\beta further benefits larger scale.
  • **Adaptive contrastive weight (αreg\alpha_{\rm reg}) robustly outperforms static values.
  • Group size (GG): Gains saturate beyond G=8G=8.
  • Preference margin (δ\delta): Excluding too many near-ties depletes supervision; optimal δ0.1\delta\sim0.1 reward standard deviation.

Analysis shows sharper log-probability margin between correct and incorrect rollouts (2.7×) and greater diversity (tail spread) under AMIR-GRPO, indicative of reduced collapse and stronger decision boundaries.

6. Qualitative Examples, Limitations, and Future Work

Qualitative inspection reveals AMIR-GRPO suppresses short, incomplete reasoning chains favored by GRPO’s normalization, instead promoting longer, coherent solutions. Contrasts between high-reward and near-miss completions correct errors (e.g., sign or case misclassifications) that scalar advantage signals fail to rectify.

Limitations include:

  • Current scope is restricted to text-based mathematical reasoning; generalization to code or vision-language modalities may require reward redesign.
  • Policy learning occurs only at trajectory-level granularity; finer credit assignment (e.g., stepwise, tree-structured rollout analysis) remains an open direction.
  • Hyperparameter tuning (particularly β\beta, δ\delta, ϵ\epsilon, γ\gamma) requires per-domain adjustment; automated adaptation is a potential area for improvement.

A plausible implication is that AMIR-GRPO’s contrastive augmentation could benefit other reinforcement learning decoders relying on groupwise batch supervision, especially where dense preference signals are available with no annotation cost.

7. Summary and Significance

AMIR-GRPO augments GRPO by explicitly extracting and enforcing O(G2)O(G^2) intra-group preference signals via DPO-inspired contrastive regularization. This yields amplified suppression of low-reward completions, abated response-length bias, sharper separation between solution- and error-trajectories, and consistent coverage improvements across reasoning-intensive mathematical task distributions (Yari et al., 7 Jan 2026, Vojnovic et al., 25 Feb 2025). The methodology highlights the critical role of implicit rank-supervision in scalable LLM alignment, rendering AMIR-GRPO an influential variant for robust reasoning alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AMIR-GRPO.