Efficient Process Reward Modeling via Contrastive Mutual Information

Published 12 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.10660v1)

Abstract: Recent research has devoted considerable effort to verifying the intermediate reasoning steps of chain-of-thought (CoT) trajectories using process reward models (PRMs) and other verifier models. However, training a PRM typically requires human annotators to assign reward scores to each reasoning step, which is both costly and time-consuming. Existing automated approaches, such as Monte Carlo (MC) estimation, also demand substantial computational resources due to repeated LLM rollouts. To overcome these limitations, we propose contrastive pointwise mutual information (CPMI), a novel automatic reward labeling method that leverages the model's internal probability to infer step-level supervision while significantly reducing the computational burden of annotating dataset. CPMI quantifies how much a reasoning step increases the mutual information between the step and the correct target answer relative to hard-negative alternatives. This contrastive signal serves as a proxy for the step's contribution to the final solution and yields a reliable reward. The experimental results show that CPMI-based labeling reduces dataset construction time by 84% and token generation by 98% compared to MC estimation, while achieving higher accuracy on process-level evaluations and mathematical reasoning benchmarks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a CPMI framework that quantifies belief shifts to efficiently assign step-level rewards in chain-of-thought reasoning.
It leverages model log-likelihood differences and contrastive negatives to reduce annotation time by 84% compared to Monte Carlo methods.
The approach achieves superior performance in mathematical reasoning benchmarks, demonstrating its scalability and robust discriminative power.

Efficient Process Reward Modeling via Contrastive Mutual Information

Introduction

This paper introduces a novel framework, Contrastive Pointwise Mutual Information (CPMI), for process reward modeling (PRM) in the context of LLM reasoning, specifically targeting step-level supervision in chain-of-thought (CoT) mathematical problem solving (2604.10660). Process reward models are critical for evaluating the correctness of intermediate steps in reasoning trajectories, facilitating both fine-grained evaluation and reward optimization. The difficulty in obtaining high-quality, large-scale, human-annotated step-level rewards has limited previous approaches, which typically either rely on costly human annotations or expensive Monte Carlo (MC) estimation via repeated LLM rollouts. The CPMI approach leverages the model’s own log-likelihoods to automate reward assignment, dramatically increasing annotation efficiency and scalability.

Figure 1: Main framework of reward modeling and PRM training leveraging CPMI-based step-level rewards for efficient, automated supervision.

Methods

CPMI Reward Definition

CPMI quantifies how much a given reasoning step shifts the model’s beliefs toward the correct (gold) answer and away from a set of hard negative (incorrect, plausible) answers. Formally, for each step $s_i$ and $M$ contrastive negatives, the CPMI reward is given by:

$r^i_{\mathrm{CPMI}} = \left[ \log p_\theta(A|q, s_i) - \log p_\theta(A|q) \right] - \frac{1}{M} \sum_{m=1}^M \left[ \log p_\theta(\tilde{A}_m|q, s_i) - \log p_\theta(\tilde{A}_m|q) \right].$

This design enables highly efficient, single-forward-pass reward annotation that emphasizes both positive evidence for the ground-truth answer and negative evidence against plausible distractors, providing a dense local reward signal analogous to temporal-difference (TD-0) bootstrapping in RL. Theoretical analysis shows that CPMI approximates the Jeffreys divergence between the pre- and post-step model conditional distributions, yielding strong discriminative signals.

Dataset Construction

The CPMI-80k dataset is constructed by automating step-level reward annotation over reasoning trajectories from the Math-Shepherd corpus, aligned to GSM8K and MATH. Hard negative answers are sampled via the model itself, with additional heuristic perturbations (operator swaps, sign changes) to ensure contrastiveness and informativeness. Four prompt templates are used for reward stabilization via prompt diversification.

PRM Training

PRMs are trained with Qwen3-4B-Base as backbone, attaching a two-layer reward head and optimizing normalized CPMI rewards with a binary cross-entropy (BCE) objective. For robustness, reward values are z-score normalized and mapped to $[0,1]$ via sigmoid. Baselines include MC and process-advantage (PAV) reward models. Hybrid CPMI-Merge variants are also explored, where MC is used for early steps and CPMI for subsequent steps, balancing bias-variance and computational cost.

Experimental Results

Annotation Efficiency

Empirical results highlight the significant efficiency gains of CPMI-based reward annotation. CPMI reduces dataset construction time by 84% and generated tokens by 98% compared to MC estimation, without sacrificing reward quality or downstream performance.

Discriminative Power in Step Quality Assessment

CPMI-labeled PRMs show improved discriminative power over MC and PAV baselines in step-level error detection and process error identification as measured by ROC-AUC, ProcessBench F1, and PRMBench metrics. The explicit contrastive term in CPMI is critical; ablations show that removal of the contrastive negative term leads to marked drops in step-level and downstream performance.

Figure 2: PRM probability distributions on ProcessBench Math split, showing clear separation between correct (blue) and incorrect (orange) steps for CPMI-trained PRMs.

Downstream Impact: Mathematical Reasoning and Sampling

On math reasoning benchmarks (MATH, GSM8K, MMLU, Omni-MATH), CPMI-trained PRMs consistently yield higher best-of- $N$ (BoN) accuracy for larger $N$ , reflecting their stronger trajectory ranking capability. Unlike MC, for which gains saturate or diminish with increasing $N$ , CPMI PRMs continue to scale.

Figure 3: Best-of- $N$ reasoning accuracy on math benchmarks, demonstrating consistent scaling of CPMI and CPMI-Merge PRMs with increased sampling.

Ablation: Influence of Hard-Negative Pool Size

Variation of the number of contrastive negatives $M$ indicates that inclusion of even one hard negative is essential; however, performance is relatively robust for moderate values of $M$ . Overly large $M$ 0 dilutes signal strength; defaulting to $M$ 1 balances stability and efficiency.

Figure 4: Ablation on number of hard negatives $M$ 2 in CPMI objective, highlighting stable process and downstream task performance for moderate $M$ 3.

Robustness and Generalization

CPMI-based reward labeling generalizes across backbone model sizes (down to 1B parameter LLMs) and to logical reasoning benchmarks, outperforming MC in all tested regimes at much lower annotation cost. CPMI’s efficacy is independent of downstream training objective, with superiority maintained under both regression (MSE) and pairwise ranking loss.

Implications and Future Directions

The CPMI framework demonstrates that contrastive, model-intrinsic belief shifts can effectively and efficiently supervise process-level reward models for reasoning tasks. This result has direct implications for large-scale, scalable PRM dataset construction, making fine-grained process supervision feasible within both academic and industrial constraints. The framework’s efficiency is particularly salient in resource-constrained environments and for rapid reward dataset development.

Theoretically, CPMI offers an explicitly contrastive, symmetric information-theoretic reward which penalizes reward hacking risks associated with non-contrastive likelihood-based objectives. Its design admits straightforward extension to online reinforcement learning frameworks, model-based RL, or integration with self-improving verifier approaches.

Future work directions include scaling CPMI-annotated process data to larger domains (including multi-hop and non-math tasks), further diversification of hard-negative construction strategies, and systematic benchmarking with larger PRM architectures and online RL settings.

Conclusion

CPMI enables efficient, highly discriminative process reward modeling for LLMs by leveraging the model’s own local evidence shifts over gold and contrastive hypotheses. This yields superior process supervision signals at a fraction of the computational budget required by MC-based approaches. CPMI-trained PRMs not only achieve stronger process-level assessment but also provide more effective control for mathematical reasoning via best-of- $M$ 4 selection, with consistently better sample efficiency and generalization. This framework provides a practical, theoretically-grounded tool for the next generation of process supervision in structured LLM reasoning (2604.10660).