RoBench25: Evaluating Robotic Cognition

Updated 12 January 2026

RoBench25 is a robotics benchmark that evaluates multimodal LLMs' high-level reasoning across 25 distinct tasks spanning instruction comprehension, planning, perception, affordance prediction, and failure analysis.
It organizes tasks into five cognitive dimensions, including explicit and implicit goal translation, multi-view perception, and spatiotemporal planning, ensuring fine-grained evaluation.
The benchmark uses a dataset of over 6,000 QA pairs derived from real-world robotic scenarios to simulate diverse, attribute-rich environments.

RoBench25 denotes the twenty-five-task core of RoboBench, a comprehensive evaluation benchmark specifically designed to probe the high-level reasoning and cognitive capabilities of multimodal LLMs (MLLMs) as they function as the “embodied brain” (System 2) within robotic manipulation pipelines. Distinct from benchmarks quantifying low-level control or execution success, RoBench25 rigorously tests MLLMs’ facility for instruction interpretation, perception-driven reasoning, planning in diverse physical/cognitive spaces, affordance prediction, and failure diagnosis under conditions reflective of real-world, attribute-rich, multi-view robotics data (Luo et al., 20 Oct 2025).

1. Design Rationale and Scope

RoBench25 is underpinned by the observation that truly systematic evaluation of the cognitive core (System 2) in robotics requires both broad and fine-grained task coverage. It organizes 25 distinct tasks across five principal cognitive dimensions:

Instruction Comprehension: Explicit and implicit goal translation to executable plans.
Perception Reasoning: Multi-modal reasoning over object, scene, temporal, causal, and referential information.
Generalized Planning: Long-horizon, next-step, and state-estimation challenges across embodiments, objects, and camera perspectives.
Affordance Prediction: Static contact points, dynamic trajectories, and navigation base-poses for manipulation and mobility.
Failure Analysis: Diagnoses covering both execution and planning errors.

These tasks span 14 mapped sub-capabilities, enabling granular measurement and comparison of embodied cognition. The dataset comprises 6,092 QA pairs, curated from both open-source and in-house robotic video/caption sources to maximize realism and coverage.

2. Task Taxonomy and Input/Output Modalities

Each of the 25 tasks is parametrized to stress distinct cognitive mechanisms:

Cognitive Dimension	#Tasks	Modalities	Typical Output
Instruction Comprehension	2	Image + Instruction	Ordered function sequence
Perception Reasoning	8	Image (+ bbox)	Multiple choice
Generalized Planning	10	RGB, text, history	Function sequence, discrete
Affordance Prediction	3	Image (+ subgoal)	2D/trajectory/base-pose
Failure Analysis	2	Video or plan	Multiple choice

Instruction Comprehension tasks require the transformation of explicit (“Put the apple on the plate”) or implicit (“I’m thirsty”) goals into structured action plans (e.g., “pick_up(cup), fill(cup, water), deliver(cup, user)”). Perception Reasoning covers object and robot identification, spatial/temporal/causal inference, and referential grounding from static frames with or without bounding box contextualization. Generalized Planning assesses both the extraction of long-horizon plans and the prediction of next steps or milestone state transitions, sometimes invoking egocentric and multi-camera observations. Affordance tasks demand outputs in the form of spatial coordinates or point trajectories, facilitating evaluation of both grasp point prediction and navigation feasibility. Failure Analysis includes both execution and planning error identification via multiple-choice responses to failed execution videos or plan diagnostics.

3. Evaluation Methodology and Scoring Frameworks

Success criteria in RoBench25 are precisely defined and interpreted in the context of each task group:

Multiple-choice tasks (Perception, Failure Analysis): Raw accuracy.
Affordance Prediction: Scores computed as $100 \times (1-d)^{2.5}$ where $d$ is the normalized Euclidean error ([0,1] scale).
Long-Horizon Planning: Combined correctness of action nodes and simulated object-state milestones, normalized as $\text{LongHorizon} = \frac{\text{NodeCorrectness} + \text{TaskCompletion}}{20}$ .
Next-Step Planning: Aggregation of skill, object, and parameter matches.
State Estimation: Standard binary accuracy.
World-Simulator Rollout: For generalized planning, proposed plans are virtually executed using a model-recursive “MLLM-as-world-simulator” pipeline to test for feasibility and correct object-state progression rather than mere string similarity.

This comprehensive approach penalizes plans that are correct in structure but physically implausible, and rewards answers aligning with ground-truth causal/temporal scene structure.

4. Dataset Construction and Realism Protocols

The RoBench25 dataset integrates data from major robotic collections (RH2.0-T, RT-X, RDT, RoboMIND, Open X Embodiment, RHD, EGO4D) as well as bespoke in-house video and image captures of dual-arm, mobile, and humanoid scenarios. Object catalogs are enhanced with attribute annotations (e.g., material, deformability) and world-knowledge distractors. Task prompts are generated by both LLM/VLM passes and refined by expert review, with majority-vote filtering to ensure exclusion of ambiguous or trivial cases. The resulting dataset spans cross-embodiment cases (single- vs. dual-arm, mobile), multi-view navigation (egocentric/memory-driven), and attribute-rich scenes.

5. Empirical Results and Model Analysis

Evaluations of 14 mainstream MLLMs (closed- and open-source, including domain-specialized models) reveal the following trends (normalized 0–100 scale):

Instruction Comprehension: Explicit tasks average ~45; implicit tasks ~15.
Perception Reasoning: Closed-source MLLMs achieve ~40–60; humans ~74.
Long-Horizon Planning: Closed-source MLLMs ~34–42; humans ~54.
Next-Step and State: Model scores vary 40–70; human ~70.
Affordance: Static tasks easier (~50–65), dynamic/navigation < static; human ~82.
Failure Analysis: Planning-level ~40–60; execution-level ~10–20; humans ~47/81.

Key limitations are observed in implicit instruction following (−30 point drop vis-à-vis explicit instructions), spatiotemporal perception (<44% for robot view, <50% for temporal grounding), physically feasible dual-arm and rare-object planning (<30%), and execution failure identification.

6. Significance, Weaknesses, and Outlook

RoBench25 establishes that current MLLMs, despite strong priors in language and vision, lack robust System 2-level grounding in several critical areas: implicit intent interpretation, robust long-horizon and dual-arm planning, embodiment-aware perception over time and space, fine-grained affordance mapping, and nuanced failure diagnosis. These findings point to a set of unaddressed research frontiers:

Learning robust, spatiotemporal world models
Integrating action-affordance knowledge with language/vision
Real-robot self-supervision for sim2real transfer
Development of meta-reasoning modules for implicit intent ambiguity and reflective diagnosis

A plausible implication is that progress on these axes will require explicitly multimodal, physically grounded, and meta-cognitive learning objectives, as well as further innovation in scalable, realistic benchmark design. RoBench25 provides the scaffold—broad task variety, simulation-driven planning evaluation, and comprehensive coverage—for researchers to measure, compare, and drive advances towards embodied MLLMs demonstrating expert-level cognitive capabilities in real-world robotic environments (Luo et al., 20 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoBench25.