Papers
Topics
Authors
Recent
Search
2000 character limit reached

Process-Aware Group Relative Policy Optimization

Updated 19 November 2025
  • PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
  • It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
  • Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.

Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).

1. Formal Foundations: From PPO to GSPO and PA-GRPO

Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: LPPO(θ)=Ex,yπθold[min(r(θ)A, clip(r(θ),1ϵ,1+ϵ)A)]\mathcal{L}^{\rm PPO}(\theta) = \mathbb{E}_{x,y \sim \pi_{\theta_{\rm old}}}\Big[\min\big(r(\theta)A,~\mathrm{clip}(r(\theta),1-\epsilon,1+\epsilon)A\big)\Big] where r(θ)r(\theta) is the likelihood ratio and AA is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.

Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query xx, GG sampled reasoning sequences {yi}i=1G\{y_i\}_{i=1}^G receive group-wise importance ratios: ri(θ)=(πθ(yix)πθold(yix))1/yir_i(\theta) = \left( \frac{\pi_{\theta}(y_i | x)}{\pi_{\theta_{\rm old}}(y_i | x)} \right)^{1 / |y_i|} The corresponding objective uses a group-relative advantage A^i\widehat{A}_i: A^i=R(x,yi)1Gj=1GR(x,yj)\widehat{A}_i = R(x, y_i) - \frac{1}{G} \sum_{j=1}^{G} R(x, y_j) GSPO's surrogate: JGSPO(θ)=Ex,{yi}[1Gi=1Gmin(ri(θ)A^i, clip(ri(θ),1ϵ,1+ϵ)A^i)]\mathcal{J}_{\rm GSPO}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( r_i(\theta)\widehat{A}_i,~\mathrm{clip}(r_i(\theta), 1 - \epsilon, 1 + \epsilon)\widehat{A}_i \right) \right] By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.

2. Process Mining Integration and Conformance Reward Construction

Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.

Given a query r(θ)r(\theta)0, both models provide reasoning traces: r(θ)r(\theta)1, and a reference trace r(θ)r(\theta)2. Both are treated as event logs. The Inductive Miner (IM) extracts a process model r(θ)r(\theta)3 from the student log, and conformance checking (CC) aligns this model with the teacher's log.

Alignment-based conformance yields two metrics per sequence: r(θ)r(\theta)4 Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by r(θ)r(\theta)5. These are merged using an F1-style metric: r(θ)r(\theta)6 which forms the core process-aware signal in PA-GRPO.

3. Combined Reward Formulation and Training Workflow

The complete PM4GRPO reward for each generated reasoning sequence combines three elements: r(θ)r(\theta)7 where r(θ)r(\theta)8 is a format reward, r(θ)r(\theta)9 is answer correctness, and AA0 is process conformance. Generalizing, one can write: AA1 Typical experiments set AA2 and AA3, though AA4 can be varied (found robust across AA5 with slight gains at upper end, and overfitting to teacher behavior when AA6 is excessive).

Training Loop (High-Level Pseudocode)

xx4 The conformance reward is computed post-sequence, and all rewards remain at the sequence level.

4. Empirical Results Across Mathematical Reasoning Benchmarks

PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.

Held-out test accuracy (problem solved exactly) is reported below:

7B-Model Performance (accuracy %):

Model MATH500 Olympiad Minerva AIME24 AIME25
R1-Distill-Qwen 90.0 58.5 49.6 42.5 33.1
DeepMath-Zero 81.6 47.3 40.4 13.3 10.0
Skywork-OR1 87.1 51.9 46.0 36.0 27.1
LEAD 84.6 52.3 47.4 40.0 26.7
DRGRPO 80.2 42.5 43.0 30.0 6.7
PRIME 79.2 38.6 26.7
P-GRPO 83.0 38.2 33.3
PM4GRPO (ours) 91.1 61.1 49.3 45.6 35.0

1.5B-Model Performance (accuracy %):

Model MATH500 Olympiad Minerva AIME24 AIME25
R1-Distill-Qwen 80.4 46.1 33.1 22.9 21.5
Graph-R1 42.1 15.5 13.9 1.2 1.0
STILL-3 83.4 51.0 36.5 29.2 23.5
EXGRPO 69.6 34.0 30.4 10.6 8.3
PM4GRPO (ours) 83.9 52.7 37.9 26.7 21.7

PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).

5. Ablation and Sensitivity Analyses

Systematic ablations tested the contribution of the conformance reward. Disabling process alignment (AA7) resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.

Sensitivity sweeps over AA8 produced stable plateaus in performance, with:

  • AA9: 90.2% (–0.9 pp vs default)
  • xx0: 91.1% (default)
  • xx1: 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)

These results suggest tuning xx2 within xx3 achieves robust performance without compromising generalization.

6. Limitations and Future Extensions

Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.

Proposed future directions include:

  1. Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
  2. Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
  3. Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.

A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Process-Aware Group Relative Policy Optimization (PA-GRPO).