Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agent Process Reward Models

Updated 5 January 2026
  • Agent Process Reward Models are a family of methods that provide stepwise, decision-level feedback to tackle the temporal credit assignment bottleneck in complex tasks.
  • They employ explicit regression, implicit log-likelihood ratios, and zero-shot judgments to generate dense supervision from Monte Carlo targets and preference-based objectives.
  • AgentPRMs integrate into planning techniques like Best-of-N sampling and beam search, demonstrating improved performance in web navigation, GUI automation, and multi-agent collaboration.

Agent Process Reward Models (AgentPRM) designate a family of methods for providing step-wise, decision-level feedback—termed process rewards—to LLM agents trained for long-horizon, partially observable tasks. Unlike classical outcome reward models that only deliver scalar feedback at episode termination, AgentPRMs generate dense supervision by evaluating each agent decision (state–action or reasoning step) according to expected trajectory utility, local progress, or correctness. Designed to address the temporal credit assignment bottleneck in sequence modeling, these models have become central to recent advances in LLM-based agents for domains including web navigation, embodied planning, tool use, GUI automation, information seeking, and multi-agent collaboration (Xia et al., 25 Feb 2025, Choudhury, 14 Feb 2025, Xi et al., 11 Nov 2025, Xiong et al., 2024, Lee et al., 24 Nov 2025, Xiong et al., 27 Sep 2025, Lin et al., 21 Oct 2025, Zhou et al., 4 Mar 2025, Gandhi et al., 2 Sep 2025, Chen et al., 17 Feb 2025, 2505.20737, Liu et al., 23 Sep 2025, Peng et al., 26 Feb 2025).

1. Formalization and Core Variants

AgentPRMs are formulated for agents acting in (partially observable) Markov decision processes or related sequential environments. At each step tt in a trajectory, the agent policy πθ\pi_\theta produces an action ata_t given state sts_t (possibly incorporating prior instructions, observations, or thoughts). The process reward model rϕ(st,at)r_\phi(s_t, a_t) estimates the quality, value, or progress associated with (st,at)(s_t, a_t), supporting both policy improvement and test-time decision selection.

Three major AgentPRM instantiations have been delineated:

Variant Key Mechanism Training/Derivation
Explicit PRM Supervised regression onto MC values rϕr_\phi fits step values via rollouts
Implicit PRM Advantage via log-likelihood ratios rϕ(s,a)=βlogπϕπrefr_\phi(s,a) = \beta \log \frac{\pi_\phi}{\pi_{\text{ref}}}
LLM-as-Judge PRM Zero-shot judgment by LLM over candidates No training; prompt-based trajectory ranking
  • Explicit PRM: Trains rϕr_\phi to regress Monte Carlo estimates of future return or fractional completion for each observed (st,at)(s_t, a_t), usually via a value head on an LM, as in MCTS-inspired pipelines (Xia et al., 25 Feb 2025, Choudhury, 14 Feb 2025).
  • Implicit PRM: Defines step-wise advantage from log-probability ratios between the current policy and a reference, often interpreted as a Q-value; matches the "Free Process Rewards" approach (Xia et al., 25 Feb 2025, Liu et al., 23 Sep 2025).
  • LLM-as-Judge PRM: Leverages an LLM to directly evaluate and select preferred full trajectories post hoc at test time, bypassing explicit model training (Xia et al., 25 Feb 2025).

These variants are often complementary and can be evaluated jointly (Xia et al., 25 Feb 2025).

2. Training Objectives and Data Construction

AgentPRM methods rely on various training objectives depending on access to outcome rewards, demonstrations, or preference data:

  • Supervised Regression (Explicit PRM): Using a corpus of trajectories, process rewards are regressed onto Monte Carlo, TD(λ\lambda), or Generalized Advantage Estimation (GAE) targets:

Lexp(ϕ)=1Nt=1N(rϕ(st,at)V(st))2\mathcal{L}_{\text{exp}}(\phi) = \frac{1}{N} \sum_{t=1}^N (r_\phi(s_t, a_t) - V(s_t))^2

where V(st)V(s_t) is estimated via environment rollouts or value backups (Xia et al., 25 Feb 2025, Xi et al., 11 Nov 2025).

  • Preference-based/DPO (Implicit PRM): When trajectory preferences (τ+,τ)(\tau^+, \tau^-) are available, process rewards are shaped to maximize the DPO objective:

$L_{\rm PRM}(\phi) = -\E_{(\tau^+,\tau^-)} \log \sigma \big( \textstyle\sum_t r_\phi(s^+_t, a^+_t) - \sum_t r_\phi(s^-_t, a^-_t) \big)$

with rϕr_\phi defined via log-ratio of step likelihoods (2505.20737, Liu et al., 23 Sep 2025).

  • Contrastive and Stepwise Preference Mining: Iterative mining of contrastive (good/bad) action pairs at each step using reward differentials expands coverage of potential errors and ambiguities (Xiong et al., 2024).
  • Synthetic Annotation: In data-limited scenarios, stepwise or trajectory preference data are synthesized via LLM annotation or minimal perturbations (Chen et al., 17 Feb 2025, Zhou et al., 4 Mar 2025).
  • Auxiliary Criteria: In multi-agent or task-decomposition regimes, collaborative reward models can use correctness, cost, and historical success, especially for subtask assignment (Zhou et al., 4 Mar 2025).

3. Integration into Search, Planning, and Inference

AgentPRMs are deployed in several critical inference-time algorithms for selection and guidance:

4. Empirical Results and Evaluation Protocols

AgentPRMs show consistent, significant improvements across a variety of agentic benchmarks. Representative quantitative findings include:

System Task Domains Best-of-N Gain Notable Findings Reference
AgentRM Web, Embodied, +8.8 avg, +12.6 on large Outperforms state-of-art, generalizes to new tasks (Xia et al., 25 Feb 2025)
AgentPRM Language, Web, RL Up to +11 pp (Llama-70B) Strong improvements vs. SFT, scalable for search (Chen et al., 17 Feb 2025)
PRInTS Information Seek +9.3 pts (base), +3–4 pts Outperforms strong agent and reward baselines (Lee et al., 24 Nov 2025)
SWE-PRM Software Eng +10.6 pp, shorter runs Taxonomy-guided feedback most effective (Gandhi et al., 2 Sep 2025)
CUAReward GUI/OS UPE: 81.7% step-P Ensembles outperform specialized CUA reward models (Lin et al., 21 Oct 2025)
GUI-PRA GUI Automation +9–20 pp Dynamic memory and adaptive perception key (Xiong et al., 27 Sep 2025)

Key metrics include success rate, progress rate, downstream accuracy under "LLM-as-judge," and per-step precision/NPV in process reward assessment.

Evaluation typically controls for both policy improvement and process reward accuracy, with ablations of data scale, architecture, state representation, and test-time search parameters (Xia et al., 25 Feb 2025, Xi et al., 11 Nov 2025, Lin et al., 21 Oct 2025).

5. Extensions: Multi-Agent, Domain-Specific, and Theoretical Innovations

AgentPRMs have served as the foundation for:

  • Multi-agent Collaboration: CRMs (collaborative reward models) for agent assignment in decomposed task graphs (Zhou et al., 4 Mar 2025).
  • Long-Horizon Information Seeking: Summarization-augmented PRMs (e.g., PRInTS) compress growing context, enabling dense, multi-dimensional reward for navigation and tool use (Lee et al., 24 Nov 2025).
  • Software and GUI Agents: Taxonomy-driven process feedback (Gandhi et al., 2 Sep 2025), context-adaptive scoring, and dynamic memory or UI-state grounding (Xiong et al., 27 Sep 2025, Lin et al., 21 Oct 2025).
  • Online Process Reward Learning: Implicit stepwise rewards from trajectory-level preference via DPO, yielding potential-based shaping and improved RL stability (Liu et al., 23 Sep 2025).
  • Agentic Reward Modeling: Composite reward systems combining human preference with verifiable correctness checks (factuality, instruction-following) (Peng et al., 26 Feb 2025).

Novel theoretical contributions include analyses showing that implicit process rewards via log-ratio or preference-based shaping respect optimality conditions and provide bounded, variance-reducing gradients in RL policy optimization (Liu et al., 23 Sep 2025, Xi et al., 11 Nov 2025).

6. Challenges and Open Directions

AgentPRM systems, while sample-efficient and flexible, face open challenges:

Future directions include richer world models for model-predictive AgentPRMs, multi-criteria or hierarchical process rewards, advanced ensemble techniques, and unified architectures coupling policy and reward via shared backbones or joint online optimization (Choudhury, 14 Feb 2025, Xi et al., 11 Nov 2025, Lin et al., 21 Oct 2025, Xiong et al., 27 Sep 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent Process Reward Models (AgentPRM).