AMIR-GRPO: Advanced RL Alignment Algorithm

Updated 14 January 2026

AMIR-GRPO is a reinforcement learning alignment algorithm for LLMs that leverages implicit pairwise preferences to optimize group-level rollouts.
It employs a DPO-style contrastive regularizer to densify supervision, reduce gradient variance, and enforce clear decision boundaries.
Empirical results demonstrate notable improvements in Pass@1 scores and sample efficiency across diverse mathematical reasoning benchmarks.

AMIR-GRPO (Augmented Merit-guided Implicit-preference Regularization for Group Relative Policy Optimization) is an advanced reinforcement learning alignment algorithm for LLMs focused on complex reasoning tasks. It extends standard GRPO by leveraging implicit pairwise preference signals latent in group-level rollout rewards to construct a contrastive regularizer, thereby amplifying policy supervision and mitigating structural biases.

1. Foundations and Motivation

Group Relative Policy Optimization (GRPO) aligns LLM policies through reinforcement learning using group-based evaluation. For each prompt $q$ , GRPO samples a group of $G$ completions $\{o_i\}_{i=1}^G$ from the current policy $\pi_{\theta_{\rm old}}$ , independently assigns scalar rewards $r_i$ , and computes a group-normalized sequence-level advantage for each completion:

$A_i = \frac{r_i - \mu_r}{\sigma_r} \quad \text{where} \quad \mu_r = \frac{1}{G} \sum_{j=1}^G r_j, \; \sigma_r = \sqrt{\frac{1}{G} \sum_j (r_j-\mu_r)^2}$

These advantages modulate PPO-style surrogate rewards in policy optimization.

GRPO's scalar rewards and normalization induce critical limitations:

Sequence-length bias: Positive advantages disproportionately encourage shorter completions; penalties on long low-reward trajectories are diluted.
Sparse-reward inefficacy: Negative advantages assigned to many incorrect rollouts become numerically weak after normalization, limiting effective suppression.
Preference information compression: Each group potentially contains $O(G^2)$ pairwise ordering constraints between completions, yet only $O(G)$ scalar signals are utilized.

AMIR-GRPO exploits these implicit intra-group preferences to densify the supervision signal and improve policy discrimination without requiring added human labeling (Yari et al., 7 Jan 2026).

2. Objective Formulation and Implicit-Preference Construction

2.1 GRPO Surrogate Objective

Let $\pi_\theta$ denote the current model, $\pi_{\rm ref}$ a reference policy, and $\epsilon$ the PPO clipping threshold. The GRPO objective maximizes:

$J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}}\sum_{i,t}\min\left[ r_{i,t}(\theta)A_i,\, \mathrm{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon)A_i \right] - \gamma D_{\rm KL}(\pi_\theta\|\pi_{\rm ref})$

with

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}\mid q,\,o_{i,<t})}{\pi_{\theta_{\rm old}}(o_{i,t}\mid q,\,o_{i,<t})}$

and reference KL penalty.

2.2 Implicit Preference Set

From the reward group $\{r_i\}$ , AMIR-GRPO infers the set of robust pairwise preferences: $S(q) = \{(i,j)\mid r_i > r_j + \delta \}$ where $\delta$ is a reward margin threshold excluding near-ties. Each pair encodes a supervision that $o_i$ is strictly preferable to $o_j$ .

2.3 DPO-Style Contrastive Regularizer

For every pair $(i,j)\in S(q)$ , a DPO-style contrastive logit is computed: $Z_{i,j}(\theta) = \beta \left[ (\log\pi_\theta(o_i\mid q) - \log\pi_{\rm ref}(o_i\mid q)) - (\log\pi_\theta(o_j\mid q) - \log\pi_{\rm ref}(o_j\mid q)) \right]$ with temperature $\beta$ . The regularization loss is: $L_{\rm contrastive}(\theta) = -\frac{1}{|S|} \sum_{(i,j)\in S} \log \sigma(Z_{i,j}(\theta))$ where $\sigma$ denotes the sigmoid function.

2.4 Combined AMIR-GRPO Objective

The final learning objective is: $J_{\rm AMIR-GRPO}(\theta) = J_{\rm GRPO}(\theta) + \alpha_{\rm reg} J_{\rm pref}(\theta)$ where $\alpha_{\rm reg}$ adaptively balances the regularizer according to the batchwise ratio $\rho = |J_{\rm pref}/J_{\rm GRPO}|$ .

3. Algorithmic Procedure

AMIR-GRPO is implemented iteratively with the following steps:

Sample minibatches of prompts and generate $G$ completions per prompt under $\pi_{\theta_{\rm old}}$ .
Compute normalized advantages $A_i$ and construct the implicit preference set $S(q)$ for each prompt.
Calculate the clipped GRPO surrogate objective.
Compute the contrastive regularizer across all pairwise preferences using DPO-style logits.
Dynamically adjust $\alpha_{\rm reg}$ to maintain a target ratio between preference regularizer and surrogate objective.
Combine all terms and backpropagate gradients to update $\theta$ ; periodically sync $\pi_{\theta_{\rm old}}$ .

Stabilization measures include margin filtering ( $\delta$ ), length normalization in log-probabilities, PPO-style trust-region clipping ( $\epsilon$ ), and online adaptation of regularization strength.

4. Theoretical Properties

AMIR-GRPO leverages implicit preference relations, yielding increased supervision density compared to vanilla GRPO:

Supervision richness: Each group instantiates up to $O(G^2)$ preference constraints (vs. $O(G)$ scalar advantages).
Gradient variance reduction: Amplified penalties on low-reward completions via multiple contrastive judgments sharpen learning and suppress poor trajectories.
Margin enforcement: Contrastive DPO-style updates enforce separation in log-probability space, tightening boundaries between high-reward and low-reward completions.
The core GRPO component, with PPO-style clipping and KL, inherits trust-region convergence properties.

AMIR-GRPO’s alignment mechanism fundamentally diverges from exponential/logarithmic pooling approaches (see classical RLHF):

Standard RLHF uses an objective of the form $J_{\rm RLHF}(\theta) = \mathbb{E}[r_\phi(o|q)] - \beta\,\mathrm{KL}(\pi_\theta \| \pi_{\rm ref})$ , with stationary solution

$\pi_\theta(o|q) \propto \pi_{\rm ref}(o|q)\exp[r_\phi(o|q)/\beta]$

(i.e., geometric pooling).

By contrast, GRPO and AMIR-GRPO yield a rational multiplicative update, not exponential, resulting in non-logarithmic aggregation of reference policy and group-based preference signals (Vojnovic et al., 25 Feb 2025).

5. Empirical Evaluation

5.1 Datasets and Metrics

Benchmarking spans in-distribution and out-of-distribution mathematical reasoning task suites:

Training on GSM8K (grade-school math), evaluation on GSM8K test, AIME 2025, OlympiadBench, AMC23, MinervaMath, AQUA-RAT, and LiveMathBench.
Main metric is Pass@k, estimated by unbiased combinatorial sampling ( $n=8$ rollouts/question, $k$ correct ones).

5.2 Main Results

AMIR-GRPO achieves consistent improvements over GRPO across multiple architectures (Qwen2.5-3B, Qwen2.5-7B, Gemma-4B). Representative gains:

Dataset/Model	GRPO Pass@1	AMIR-GRPO Pass@1	Δ Pass@1
AMC23/7B	40.5%	43.2%	+2.7
LiveMath/4B	19.0%	23.1%	+4.1

Pass@2 and Pass@4 improvements are proportionally larger, indicating enhanced sample efficiency and expanded coverage: AMIR-GRPO solves 8.8% more problems exclusively than GRPO on AMC23 (Yari et al., 7 Jan 2026).

5.3 Ablation and Trajectory Analysis

Temperature ( $\beta$ ) effects: Lower $\beta$ stabilizes smaller models; intermediate $\beta\sim0.5$ is optimal on 7B-class; higher $\beta$ further benefits larger scale.
**Adaptive contrastive weight ( $\alpha_{\rm reg}$ ) robustly outperforms static values.
Group size ( $G$ ): Gains saturate beyond $G=8$ .
Preference margin ( $\delta$ ): Excluding too many near-ties depletes supervision; optimal $\delta\sim0.1$ reward standard deviation.

Analysis shows sharper log-probability margin between correct and incorrect rollouts (2.7×) and greater diversity (tail spread) under AMIR-GRPO, indicative of reduced collapse and stronger decision boundaries.

6. Qualitative Examples, Limitations, and Future Work

Qualitative inspection reveals AMIR-GRPO suppresses short, incomplete reasoning chains favored by GRPO’s normalization, instead promoting longer, coherent solutions. Contrasts between high-reward and near-miss completions correct errors (e.g., sign or case misclassifications) that scalar advantage signals fail to rectify.

Limitations include:

Current scope is restricted to text-based mathematical reasoning; generalization to code or vision-language modalities may require reward redesign.
Policy learning occurs only at trajectory-level granularity; finer credit assignment (e.g., stepwise, tree-structured rollout analysis) remains an open direction.
Hyperparameter tuning (particularly $\beta$ , $\delta$ , $\epsilon$ , $\gamma$ ) requires per-domain adjustment; automated adaptation is a potential area for improvement.

A plausible implication is that AMIR-GRPO’s contrastive augmentation could benefit other reinforcement learning decoders relying on groupwise batch supervision, especially where dense preference signals are available with no annotation cost.

7. Summary and Significance

AMIR-GRPO augments GRPO by explicitly extracting and enforcing $O(G^2)$ intra-group preference signals via DPO-inspired contrastive regularization. This yields amplified suppression of low-reward completions, abated response-length bias, sharper separation between solution- and error-trajectories, and consistent coverage improvements across reasoning-intensive mathematical task distributions (Yari et al., 7 Jan 2026, Vojnovic et al., 25 Feb 2025). The methodology highlights the critical role of implicit rank-supervision in scalable LLM alignment, rendering AMIR-GRPO an influential variant for robust reasoning alignment.

Markdown Report Issue Upgrade to Chat

References (2)

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO (2026)

What is the Alignment Objective of GRPO? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AMIR-GRPO.