AgentRewardBench: Unified Reward Evaluation
- AgentRewardBench is a unified benchmark suite that systematically evaluates agent reward models with fine-grained, step-level assessments.
- It focuses on perception, planning, and safety across diverse real-world-inspired scenarios such as web navigation, autonomous driving, and embodied tasks.
- The benchmark standardizes reward model evaluation, highlights performance gaps, especially in safety-critical tasks, and guides improvements in multimodal agent learning.
AgentRewardBench is a unified benchmark suite designed to systematically evaluate the reward modeling capabilities of multimodal LLMs (MLLMs) and agent-centric reward models. It targets the growing need for reliable, fine-grained external feedback to guide agents in real-world tasks—especially in settings with perception, sequential planning, and safety-critical elements—by providing step-level assessments across diverse, multimodal scenarios. The suite allows for standardized, reproducible evaluation and comparison of candidate reward models, filling a crucial methodological and empirical gap in agent learning beyond imitation learning paradigms (Men et al., 26 Jun 2025).
1. Motivation and Scope
AgentRewardBench addresses limitations in prevailing agent training methodologies, which have largely relied on supervised imitation learning with scarce and costly expert trajectories. This regime restricts advances in agent self-correction and generalization, especially for tasks demanding intricate, multi-step reasoning or proactive risk management. By leveraging reward models (RMs) as external feedback providers—either for reward-guided reinforcement learning or as search heuristics—there is potential to unlock more flexible, scalable, and autonomous agent behaviors.
Despite this promise, robust selection and evaluation of agent reward models have been hampered by the absence of domain-appropriate, fine-grained benchmarks. AgentRewardBench was developed to fill this void. It provides three essential dimensions of evaluation—perception, planning, and safety—spanning real-world-like agent scenarios (web navigation, desktop and mobile environments, autonomous driving, embodied tasks, and open-ended games like Minecraft) and exposes granular, step-level feedback structures (Men et al., 26 Jun 2025).
2. Benchmark Structure and Scenarios
AgentRewardBench is organized along three application dimensions, each represented by concrete, real-world-inspired agent settings:
| Dimension | Scenario | Input Modality & Task |
|---|---|---|
| Perception | Web GUI Perception | Screenshot + grounding instruction → bounding box localization |
| Embodied Perception | Egocentric image with boxes → ordered object list | |
| Planning | Web Navigation Planning | GUI screenshot + history → next GUI action (CLICK/TYPE/SELECT) |
| Embodied Driving Planning | Driving image + speed → legal, safe maneuver selection | |
| Embodied Household Planning | Virtual home image + goal/inventory → next logical subtask | |
| Minecraft Planning | Game screenshot + inventory → next survival step selection | |
| Safety | Web Safety | Screenshot of risky scenario → safest response (reject, localize, warn) |
| Embodied Safety | Dangerous egocentric scene → safe action plan and warning if unsafe |
Agents are evaluated by comparing two candidate next-step responses—one correct, one incorrect—given these multimodal contextual inputs. For each scenario, step-level decisions are central: reward models must distinguish which of the pair deserves the higher reward, as judged against human annotations. Samples are systematically curated from ten diverse models (commercial and open-source, 7B–70B scale), with an emphasis on controlled difficulty and domain coverage (Men et al., 26 Jun 2025).
3. Step-Level Reward Evaluation and Data Quality
The benchmark constructs step-level evaluation pairs as follows: for each task instance, two responses (r⁺ = correct, r⁻ = incorrect) are drawn from the candidate set Sᵣ, with each pair presented in both possible orders to minimize response biases. Formally, each evaluation requires the RM M to select the superior answer, and step-level accuracy is defined as
where if the positive response is correctly preferred, and is the total number of ordered pairs.
Difficulty control is effected via multi-stage filtering: candidate step-response pairs are first scored by three distinct small models to retain only those in the “mid” or “hard” bands, preventing trivial judgments. Final sets undergo rigorous human verification to eliminate ambiguous or label-noise-inducing examples. Of 1,443 initial pairs, 1,136 passed two rounds of expert review, constituting the high-quality evaluation backbone (Men et al., 26 Jun 2025).
4. Metrics, Results, and Practical Insights
AgentRewardBench defines the following core metrics:
- Pairwise Accuracy (step-level): Percentage of ordered pairs where the RM identifies the positive response.
- Dimension Averages: Mean accuracy for each primary task axis (perception, planning, safety).
- Overall Score: Arithmetic mean of the three dimension averages.
- Downstream Correlation: Pearson correlation between a model's score on AgentRewardBench and its real-world impact, e.g., its policy quality when deployed in reward-guided A* search (for VisualWebArena web navigation, ρ = 0.981, p = 0.003).
Empirical results demonstrate fundamental limitations. The best-performing RM (GPT-4o-2024-08-06) achieved only 61.4% overall (65.9% perception, 73.2% planning, 39.2% safety); Gemini-1.5-Pro reached 61.6% overall, highlighting a persistent safety gap even in state-of-the-art models. Open-source models lagged further behind (50–55% overall). Larger models consistently outperformed smaller variants, particularly on planning-intensive scenarios, but no evaluated RM achieved robust safety discrimination, suggesting a key research frontier. This pattern underlines an urgent requirement for task-specific RM pre-training and augmented data collection in underrepresented dimensions (Men et al., 26 Jun 2025).
5. Usage Protocols and Best Practices
Robust application of AgentRewardBench mandates explicit experimental protocols:
- Prompt Standardization: Use the “Compare Template” with zero temperature and unambiguous forced-choice instruction.
- Bias Mitigation: Every pair (r⁺, r⁻) is presented in both orderings; final accuracy is the average, minimizing label-side artifacts.
- Difficulty Calibration: Filter pairs with small reward models to maintain moderate challenge.
- Manual QC: Involve annotators trained in relevant domains, critically important for subtle or safety-related failure modes.
- Aggregation: Report per-scenario, per-dimension, and overall statistics for comprehensive comparison.
- Downstream Validation: Optionally validate the practical transfer of RM quality by integrating with learning or search algorithms.
- Training Regimens: For new RMs, fine-tune on AgentRewardBench pairs, prioritizing safety cases where data is sparse (Men et al., 26 Jun 2025).
6. Positioning Among Related Benchmarks
AgentRewardBench is distinct from pre-existing benchmarks such as “AgentRewardBench” focused on LLM-judge validity for full-task web navigation trajectories (Lù et al., 11 Apr 2025) and CUARewardBench for computer-using agents involving both outcome (trajectory-level) and process (step-level) assessment (Lin et al., 21 Oct 2025). Specifically:
- Trajectory-based Benchmarks: The original “AgentRewardBench” (Lù et al., 11 Apr 2025) evaluates LLMs as judges over entire web agent trajectories, measuring agreement with expert labels on success, side-effects, and repetition. Its judgments are at the trajectory level and do not expose the step-level granularity or fine discrimination desired in reward model training and evaluation. Rule-based evaluation baselines from this benchmark systematically underestimate agent performance due to rigid or incomplete success criteria.
- CUARewardBench: Expands to desktop software evaluation, introducing ORM (outcome reward models) and PRM (process reward models) and ensemble scoring. It emphasizes precision, negative predictive value, and coverage of UI diversity, but centers on computer-using (desktop) agents (Lin et al., 21 Oct 2025).
AgentRewardBench instead provides a unified, step-focused testbed for multimodal agents covering web, mobile, autonomous driving, virtual homes, and open-ended game-like tasks, with special attention to safety, making it critical for agent R&D that aims to move beyond imitation learning into robust learning-from-feedback (Men et al., 26 Jun 2025).
7. Significance, Limitations, and Future Directions
AgentRewardBench stands as a standardized suite enabling rigorous, reproducible evaluation of agent-specific reward models across perception, planning, and safety. It exposes significant gaps in RM performance—especially in safety and generalization—and motivates the development of richer, scenario-specific reward modeling methods. The benchmark’s emphasis on carefully curated, human-validated, step-level distinctions establishes a high bar for model selection, diagnostics, and future RM pretraining data.
A plausible implication is that progress on AgentRewardBench may correlate to measurable gains in downstream reinforcement learning efficiency and real-world agent reliability, but systematic assessment along this line remains an avenue for follow-up experimental work. Open directions include expanding scenario diversity, increasing safety sample coverage, and integrating auxiliary signals (e.g., logs, accessibility metadata) into both the dataset and the evaluation pipeline; as well as tailoring RM architectures explicitly for nuanced multimodal grounding and sequential decision support (Men et al., 26 Jun 2025).