Embodied Reasoning Intelligence (ERIQ)

Updated 2 January 2026

ERIQ is a benchmark that quantifies embodied reasoning by evaluating visual, spatial, and collaborative planning capabilities in artificial agents.
It differentiates high-level cognitive planning from low-level motor control using deterministic scoring over diverse simulated and real-world tasks.
Empirical findings reveal a strong correlation between ERIQ scores and open-world performance while highlighting bottlenecks in tool selection, planning, and collaboration.

The Embodied Reasoning Intelligence Quotient (ERIQ) is a quantitative metric and diagnostic benchmark developed to assess the embodied reasoning capabilities of artificial agents, specifically in the context of vision-language and embodied tasks. ERIQ formalizes the evaluation of models' physical reasoning, planning, adaptation, and collaborative capacities within simulated and real-world robotic manipulation environments, decoupling high-level cognitive reasoning from low-level motor control and execution. It serves as a principled foundation for evaluating embodied intelligence beyond traditional linguistic benchmarks, enabling systematic identification of reasoning bottlenecks in vision-language-action (VLA) models and embodied language agents (Liu et al., 30 Dec 2025, Wang et al., 7 Aug 2025).

1. Formal Definition and Scoring Functions

ERIQ provides a deterministic, reproducible framework for measuring embodied reasoning accuracy. In both major instantiations—robotic manipulation-focused ERIQ (Liu et al., 30 Dec 2025) and agent-centric OmniEAR (Wang et al., 7 Aug 2025)—the core metric is an average over binary success indicators across a benchmark suite:

Let $Q = \{ q_1, ..., q_N \}$ be the set of question–answer (QA) pairs ( $N=6\ 052$ for ERIQ (Liu et al., 30 Dec 2025)) or $S = \{ s_1, ..., s_N \}$ the set of scenarios ( $N=1\ 500$ for OmniEAR (Wang et al., 7 Aug 2025)).
For ERIQ, the indicator $I_i = 1$ if the model's answer is correct, $0$ otherwise:

$\mathrm{ERIQ} = \frac{1}{N}\sum_{i=1}^N I_i$

In OmniEAR, the success binary $I_M(s)$ for agent $M$ and scenario $s$ is $1$ if the final world-state satisfies all target predicates, and $0$ otherwise:

$\mathrm{ERIQ}(M) = \frac{1}{N}\sum_{s\in S} I_M(s)$

Scoring is further decomposed hierarchically:

Per-dimension ERIQ $_d$ , restricting the average to the set of questions in dimension $d$ .
Per-subtask ERIQ, similarly constrained.
In OmniEAR, per-category success rates $SR_M(c)$ , aggregating into single-agent (ERIQ $_{SA}$ ) and multi-agent (ERIQ $_{MA}$ ) sub-quotients, then combined by weighted sum according to benchmark split:

$\mathrm{ERIQ}(M) = \alpha\cdot \mathrm{ERIQ}_{SA}(M) + (1-\alpha)\cdot \mathrm{ERIQ}_{MA}(M)$

with $\alpha=0.65$ reflecting scenario proportions (Wang et al., 7 Aug 2025).

2. Reasoning Dimensions and Task Structure

ERIQ explicitly covers multiple axes of embodied reasoning through a diverse set of tasks. In the robotic ERIQ benchmark (Liu et al., 30 Dec 2025), four critical reasoning dimensions are assessed, each instantiated by carefully designed sub-tasks:

Dimension	Example Subtasks	Characteristic Challenge
Spatial Perception & Grounding	Scene understanding, object referencing, view matching	Identifying referents and relations in clutter
Task Planning & Execution Monitoring	Multi-step decomposition, trajectory analysis, task progress	Reasoning over plans, monitoring success
Error Detection & Recovery	Mistake existence/classification, recovery action	Diagnosing and prescribing correction
Human Intent Understanding	Intention prediction, collaboration response	Inferring goals from context and interaction

This multidimensional organization enables hierarchical aggregation of performance, such that researchers can isolate reasoning deficits at the global, dimension, or subtask level.

OmniEAR’s implementation divides 1,500 scenarios across seven task categories spanning single-agent (Direct Command, Tool Use, Attribute Reasoning, Compound Reasoning) and multi-agent (Explicit Collaboration, Implicit Collaboration, Compound Collaboration) axes, at increasing levels of cognitive complexity (L1-L3). Each scenario mandates grounding in continuous physical properties, tool affordances, and multi-agent processes (Wang et al., 7 Aug 2025).

3. Benchmark Design and Evaluation Protocols

ERIQ’s methodology is predicated on isolating the reasoning ability of vision-LLMs from action execution:

In ERIQ (Liu et al., 30 Dec 2025): All questions are presented in a fixed multiple-choice or binary (Yes/No) format based on real robot egocentric imagery or text, not requiring trajectory rollout or motor control. Ground truth is deterministically specified, and model outputs are compared in a scoring script. Input modalities encompass single images (53%), sequential clips (26%), and image–text dialogue (21%), distributed across five domains (household, restaurant, supermarket, industrial, office).
In OmniEAR (Wang et al., 7 Aug 2025): Tasks are simulated through a directed-graph environment (nodes for agents/objects/rooms, rich continuous properties, and dynamic edges), and successful plans must result in world-states that fulfill all logical predicates. Auxiliary operational metrics—average step count, relative trajectory length, token usage—quantify planning efficiency and reasoning process overhead.

This decoupling of reasoning from execution distinguishes ERIQ from end-to-end robotic evaluation, allowing systematic, large-scale benchmarking with clear interpretability.

4. Empirical Findings and Model Analyses

Empirical results across both benchmarks highlight the diagnostic power and granularity of the ERIQ framework:

Correlational Validity: ERIQ score exhibits a "strong positive correlation" with downstream Vision-Language-Action task performance; models with higher ERIQ reliably achieve better real-world success rates, particularly in generalization to novel pick-and-place settings (Liu et al., 30 Dec 2025).
Performance Breakdown (OmniEAR Table 1):
- GPT-4o: 96.6% direct, 80.0% tool, 32.0% compound collaboration
- Gemini-2.5: 90.5% direct, 82.3% tool, 40.5% compound collaboration
- Fine-tuning significantly improves single-agent performance (e.g., Qwen2.5-3B: 0.6%→76.3% on direct), but delivers minimal multi-agent gains (implicit collaboration: 1.5%→5.5%)
Failure Modes: Common deficits include tool selection (31.2% exploration failures in tool use), compound-plan retention (28.7% planning degradation), and coordination timing (35.8% failures in implicit collaboration).

A strong implication is that current architectures remain brittle in settings requiring emergent collaboration, abstract constraint filtration, and flexible tool acumen, with substantial performance gaps relative to human-level embodied reasoning (Wang et al., 7 Aug 2025).

5. Distinctions from Traditional LLM Benchmarks

ERIQ diverges fundamentally from text-only LLM benchmarks (e.g., GLUE, SUPERGLUE, BIG-BENCH) in several ways:

Evaluation is grounded in authentic physical settings, either through real-sensor data (robot egocentric views) or rich world-graph simulation, rather than static text.
Reasoning demands include spatial perception, causal logic, intent inference, dynamic tool acquisition, and emergent collaboration, not mere linguistic inference or single-step logical entailment.
Final scoring is predicated on environment state transitions and goal predicate satisfaction, not only prediction accuracy or F1.
Auxiliary metrics (step count, token usage) directly quantify planning efficiency and process complexity.
ERIQ’s aggregate and per-dimension scoring reveal model sensitivity to “emergent reasoning demands—tool inference, constraint filtering, dynamic teaming” not captured by prior benchmarks (Wang et al., 7 Aug 2025).

6. Broader Significance and Future Directions

ERIQ provides an extensible and reproducible foundation for embodied vision-language reasoning benchmarks. Its core contributions are:

Formal quantification of embodied reasoning via average deterministic accuracy.
Comprehensive coverage of spatial, causal, error recovery, and collaborative/intersubjective reasoning dimensions.
Decoupling of cognitive evaluation from low-level execution bottlenecks.
Empirical demonstration that reasoning capability, as isolated by ERIQ, is a primary driver of open-world robotic generalization.

These properties make ERIQ a critical tool for tracking progress in generalist robotic AI and identifying the limitations of vision-language-action models. A plausible implication is that further progress in embodied AI will require addressing the reasoning bottlenecks and failure modes that ERIQ exposes, particularly with respect to emergent collaboration, abstract constraint satisfaction, and tool-centric adaptation (Liu et al., 30 Dec 2025, Wang et al., 7 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training (2025)

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Embodied Reasoning Intelligence Quotient (ERIQ).

Embodied Reasoning Intelligence (ERIQ)

1. Formal Definition and Scoring Functions

2. Reasoning Dimensions and Task Structure

3. Benchmark Design and Evaluation Protocols

4. Empirical Findings and Model Analyses

5. Distinctions from Traditional LLM Benchmarks

6. Broader Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Embodied Reasoning Intelligence (ERIQ)

1. Formal Definition and Scoring Functions

2. Reasoning Dimensions and Task Structure

3. Benchmark Design and Evaluation Protocols

4. Empirical Findings and Model Analyses

5. Distinctions from Traditional LLM Benchmarks

6. Broader Significance and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research