Process-Aware Group Relative Policy Optimization
- PA-GRPO is a reinforcement learning framework that combines process mining with group relative policy optimization to enhance multi-step reasoning in large reasoning models.
- It supplements standard correctness and formatting rewards with a conformance signal derived from event log alignment between student and teacher models.
- Empirical evaluations show PA-GRPO outperforms conventional approaches on mathematical reasoning benchmarks, with optimal beta tuning yielding improved performance.
Process-Aware Group Relative Policy Optimization (PA-GRPO), interchangeably referred to as PM4GRPO, is a reinforcement learning (RL) post-training framework targeting the enhancement of large reasoning models (LRMs) for multi-step tasks. Distinct from outcome-centric approaches, PA-GRPO integrates process mining techniques to supplement standard correctness and format-driven rewards with an additional conformance signal reflecting the procedural similarity of model reasoning to a pretrained teacher. This scalar conformance reward leverages event log analysis and process alignment metrics to quantify and incentivize expert-like reasoning traces. Empirical results demonstrate that PA-GRPO outperforms conventional GRPO post-training methods, particularly on challenging mathematical reasoning benchmarks (Park et al., 29 Oct 2025).
1. Formal Foundations: From PPO to GSPO and PA-GRPO
Standard Proximal Policy Optimization (PPO) maximizes a clipped surrogate objective at the token level: where is the likelihood ratio and is the advantage estimate. Optionally, PPO can be regularized using a KL-divergence constraint.
Group Sequence Policy Optimization (GSPO), inspired by DeepSeek-R1 et al., generalizes PPO by operating at the sequence rather than token level. For a query , sampled reasoning sequences receive group-wise importance ratios: The corresponding objective uses a group-relative advantage : GSPO's surrogate: By moving optimization to the sequence level, GSPO aligns the reward structure with long reasoning chains, improving stability and properly weighting off-policy samples.
2. Process Mining Integration and Conformance Reward Construction
Conventional GRPO methods use reward signals based only on final correctness or format. PA-GRPO integrates a conformance reward by comparing the chain-of-thought (CoT) traces of student and teacher models using process mining.
Given a query 0, both models provide reasoning traces: 1, and a reference trace 2. Both are treated as event logs. The Inductive Miner (IM) extracts a process model 3 from the student log, and conformance checking (CC) aligns this model with the teacher's log.
Alignment-based conformance yields two metrics per sequence: 4 Here, fitness quantifies accurate reproduction of reference traces, and precision penalizes extra, unreferenced behavior allowed by 5. These are merged using an F1-style metric: 6 which forms the core process-aware signal in PA-GRPO.
3. Combined Reward Formulation and Training Workflow
The complete PM4GRPO reward for each generated reasoning sequence combines three elements: 7 where 8 is a format reward, 9 is answer correctness, and 0 is process conformance. Generalizing, one can write: 1 Typical experiments set 2 and 3, though 4 can be varied (found robust across 5 with slight gains at upper end, and overfitting to teacher behavior when 6 is excessive).
Training Loop (High-Level Pseudocode)
4 The conformance reward is computed post-sequence, and all rewards remain at the sequence level.
4. Empirical Results Across Mathematical Reasoning Benchmarks
PA-GRPO was evaluated on five established mathematical reasoning benchmarks: MATH500, OlympiadBench, MinervaMath, AIME24, and AIME25. Models of 7B and 1.5B parameters were compared against contemporary baselines, including R1-Distill-Qwen, DeepMath-Zero, Skywork-OR1, LEAD, DRGRPO, PRIME, P-GRPO, Graph-R1, STILL-3, EXGRPO.
Held-out test accuracy (problem solved exactly) is reported below:
7B-Model Performance (accuracy %):
| Model | MATH500 | Olympiad | Minerva | AIME24 | AIME25 |
|---|---|---|---|---|---|
| R1-Distill-Qwen | 90.0 | 58.5 | 49.6 | 42.5 | 33.1 |
| DeepMath-Zero | 81.6 | 47.3 | 40.4 | 13.3 | 10.0 |
| Skywork-OR1 | 87.1 | 51.9 | 46.0 | 36.0 | 27.1 |
| LEAD | 84.6 | 52.3 | 47.4 | 40.0 | 26.7 |
| DRGRPO | 80.2 | 42.5 | 43.0 | 30.0 | 6.7 |
| PRIME | 79.2 | – | 38.6 | 26.7 | – |
| P-GRPO | 83.0 | – | 38.2 | 33.3 | – |
| PM4GRPO (ours) | 91.1 | 61.1 | 49.3 | 45.6 | 35.0 |
1.5B-Model Performance (accuracy %):
| Model | MATH500 | Olympiad | Minerva | AIME24 | AIME25 |
|---|---|---|---|---|---|
| R1-Distill-Qwen | 80.4 | 46.1 | 33.1 | 22.9 | 21.5 |
| Graph-R1 | 42.1 | 15.5 | 13.9 | 1.2 | 1.0 |
| STILL-3 | 83.4 | 51.0 | 36.5 | 29.2 | 23.5 |
| EXGRPO | 69.6 | 34.0 | 30.4 | 10.6 | 8.3 |
| PM4GRPO (ours) | 83.9 | 52.7 | 37.9 | 26.7 | 21.7 |
PM4GRPO demonstrates superior or near-best performance across all benchmarks, notably on the most challenging problem sets (AIME24/25).
5. Ablation and Sensitivity Analyses
Systematic ablations tested the contribution of the conformance reward. Disabling process alignment (7) resulted in a performance drop of 1.8–3.2 percentage points on MATH500 and OlympiadBench, evidencing the benefit of process-aware signals.
Sensitivity sweeps over 8 produced stable plateaus in performance, with:
- 9: 90.2% (–0.9 pp vs default)
- 0: 91.1% (default)
- 1: 91.4% (+0.3 pp, with observed overfitting to teacher behavior and increased reasoning chain length)
These results suggest tuning 2 within 3 achieves robust performance without compromising generalization.
6. Limitations and Future Extensions
Conformance checking in PA-GRPO introduces computational overhead proportional to the square of trace length, presenting scalability challenges for very long reasoning chains. The dependency on a pretrained teacher for process mining constrains the reward to teacher-aligned reasoning, thus failing to incentivize novel but correct reasoning strategies outside the teacher’s style.
Proposed future directions include:
- Learnable Process Models: Transitioning from fixed IM + CC pipelines to differentiable process-model learners.
- Hierarchical Conformance: Applying process-alignment rewards at granular reasoning step levels, such as sub-theorem validation within proofs.
- Multi-teacher Aggregation: Incorporating multiple teacher traces to encourage diverse yet valid reasoning.
A plausible implication is that continued refinement of conformance metrics and process models could further broaden the expressive and generalization capabilities of LRMs under RL paradigms. PA-GRPO establishes the methodological value of aligning chain-of-thought with expert process logs in rigorous reasoning domains (Park et al., 29 Oct 2025).