Process-Aware Group Relative Policy Optimization

Updated 19 November 2025

PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.

Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).

1. Formal Foundations: From PPO to GSPO and PA-GRPO

Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: $\mathcal{L}^{\rm PPO}(\theta) = \mathbb{E}_{x,y \sim \pi_{\theta_{\rm old}}}\Big[\min\big(r(\theta)A,~\mathrm{clip}(r(\theta),1-\epsilon,1+\epsilon)A\big)\Big]$ where $r(\theta)$ is the likelihood ratio and $A$ is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.

Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query $x$ , $G$ sampled reasoning sequences $\{y_i\}_{i=1}^G$ receive group-wise importance ratios: $r_i(\theta) = \left( \frac{\pi_{\theta}(y_i | x)}{\pi_{\theta_{\rm old}}(y_i | x)} \right)^{1 / |y_i|}$ The corresponding objective uses a group-relative advantage $\widehat{A}_i$ : $\widehat{A}_i = R(x, y_i) - \frac{1}{G} \sum_{j=1}^{G} R(x, y_j)$ GSPO's surrogate: $\mathcal{J}_{\rm GSPO}(\theta) = \mathbb{E}_{x, \{y_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min \left( r_i(\theta)\widehat{A}_i,~\mathrm{clip}(r_i(\theta), 1 - \epsilon, 1 + \epsilon)\widehat{A}_i \right) \right]$ By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.

2. Process Mining Integration and Conformance Reward Construction

Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.

Given a query $r(\theta)$ 0, both models provide reasoning traces: $r(\theta)$ 1, and a reference trace $r(\theta)$ 2. Both are treated as event logs. The Inductive Miner (IM) extracts a process model $r(\theta)$ 3 from the student log, and conformance checking (CC) aligns this model with the teacher's log.

Alignment-based conformance yields two metrics per sequence: $r(\theta)$ 4 Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by $r(\theta)$ 5. These are merged using an F1-style metric: $r(\theta)$ 6 which forms the core process-aware signal in PA-GRPO.

3. Combined Reward Formulation and Training Workflow

The complete PM4GRPO reward for each generated reasoning sequence combines three elements: $r(\theta)$ 7 where $r(\theta)$ 8 is a format reward, $r(\theta)$ 9 is answer correctness, and $A$ 0 is process conformance. Generalizing, one can write: $A$ 1 Typical experiments set $A$ 2 and $A$ 3, though $A$ 4 can be varied (found robust across $A$ 5 with slight gains at upper end, and overfitting to teacher behavior when $A$ 6 is excessive).

Training Loop (High-Level Pseudocode)

$x$ 4 The conformance reward is computed post-sequence, and all rewards remain at the sequence level.

4. Empirical Results Across Mathematical Reasoning Benchmarks

PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.

Held-out test accuracy (problem solved exactly) is reported below:

7B-Model Performance (accuracy %):

Model	MATH500	Olympiad	Minerva	AIME24	AIME25
R1-Distill-Qwen	90.0	58.5	49.6	42.5	33.1
DeepMath-Zero	81.6	47.3	40.4	13.3	10.0
Skywork-OR1	87.1	51.9	46.0	36.0	27.1
LEAD	84.6	52.3	47.4	40.0	26.7
DRGRPO	80.2	42.5	43.0	30.0	6.7
PRIME	79.2	–	38.6	26.7	–
P-GRPO	83.0	–	38.2	33.3	–
PM4GRPO (ours)	91.1	61.1	49.3	45.6	35.0

1.5B-Model Performance (accuracy %):

Model	MATH500	Olympiad	Minerva	AIME24	AIME25
R1-Distill-Qwen	80.4	46.1	33.1	22.9	21.5
Graph-R1	42.1	15.5	13.9	1.2	1.0
STILL-3	83.4	51.0	36.5	29.2	23.5
EXGRPO	69.6	34.0	30.4	10.6	8.3
PM4GRPO (ours)	83.9	52.7	37.9	26.7	21.7

PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).

5. Ablation and Sensitivity Analyses

Systematic ablations tested the contribution of the conformance reward. Disabling process alignment ( $A$ 7) resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.

Sensitivity sweeps over $A$ 8 produced stable plateaus in performance, with:

$A$ 9: 90.2% (–0.9 pp vs default)
$x$ 0: 91.1% (default)
$x$ 1: 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)

These results suggest tuning $x$ 2 within $x$ 3 achieves robust performance without compromising generalization.

6. Limitations and Future Extensions

Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.

Proposed future directions include:

Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.

A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Reasoning-Aware GRPO using Process Mining (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Process-Aware Group Relative Policy Optimization (PA-GRPO).

Process-Aware Group Relative Policy Optimization

1. Formal Foundations: From PPO to GSPO and PA-GRPO

2. Process Mining Integration and Conformance Reward Construction

3. Combined Reward Formulation and Training Workflow

Training Loop (High-Level Pseudocode)

4. Empirical Results Across Mathematical Reasoning Benchmarks

5. Ablation and Sensitivity Analyses

6. Limitations and Future Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Process-Aware Group Relative Policy Optimization

1. Formal Foundations: From PPO to GSPO and PA-GRPO

2. Process Mining Integration and Conformance Reward Construction

3. Combined Reward Formulation and Training Workflow

Training Loop (High-Level Pseudocode)

4. Empirical Results Across Mathematical Reasoning Benchmarks

5. Ablation and Sensitivity Analyses

6. Limitations and Future Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research