Embodied Reasoning Intelligence (ERIQ)
- ERIQ is a benchmark that quantifies embodied reasoning by evaluating visual, spatial, and collaborative planning capabilities in artificial agents.
- It differentiates high-level cognitive planning from low-level motor control using deterministic scoring over diverse simulated and real-world tasks.
- Empirical findings reveal a strong correlation between ERIQ scores and open-world performance while highlighting bottlenecks in tool selection, planning, and collaboration.
The Embodied Reasoning Intelligence Quotient (ERIQ) is a quantitative metric and diagnostic benchmark developed to assess the embodied reasoning capabilities of artificial agents, specifically in the context of vision-language and embodied tasks. ERIQ formalizes the evaluation of models' physical reasoning, planning, adaptation, and collaborative capacities within simulated and real-world robotic manipulation environments, decoupling high-level cognitive reasoning from low-level motor control and execution. It serves as a principled foundation for evaluating embodied intelligence beyond traditional linguistic benchmarks, enabling systematic identification of reasoning bottlenecks in vision-language-action (VLA) models and embodied language agents (Liu et al., 30 Dec 2025, Wang et al., 7 Aug 2025).
1. Formal Definition and Scoring Functions
ERIQ provides a deterministic, reproducible framework for measuring embodied reasoning accuracy. In both major instantiations—robotic manipulation-focused ERIQ (Liu et al., 30 Dec 2025) and agent-centric OmniEAR (Wang et al., 7 Aug 2025)—the core metric is an average over binary success indicators across a benchmark suite:
- Let be the set of question–answer (QA) pairs ( for ERIQ (Liu et al., 30 Dec 2025)) or the set of scenarios ( for OmniEAR (Wang et al., 7 Aug 2025)).
- For ERIQ, the indicator if the model's answer is correct, $0$ otherwise:
- In OmniEAR, the success binary for agent and scenario is $1$ if the final world-state satisfies all target predicates, and $0$ otherwise:
Scoring is further decomposed hierarchically:
- Per-dimension ERIQ, restricting the average to the set of questions in dimension .
- Per-subtask ERIQ, similarly constrained.
- In OmniEAR, per-category success rates , aggregating into single-agent (ERIQ) and multi-agent (ERIQ) sub-quotients, then combined by weighted sum according to benchmark split:
with reflecting scenario proportions (Wang et al., 7 Aug 2025).
2. Reasoning Dimensions and Task Structure
ERIQ explicitly covers multiple axes of embodied reasoning through a diverse set of tasks. In the robotic ERIQ benchmark (Liu et al., 30 Dec 2025), four critical reasoning dimensions are assessed, each instantiated by carefully designed sub-tasks:
| Dimension | Example Subtasks | Characteristic Challenge |
|---|---|---|
| Spatial Perception & Grounding | Scene understanding, object referencing, view matching | Identifying referents and relations in clutter |
| Task Planning & Execution Monitoring | Multi-step decomposition, trajectory analysis, task progress | Reasoning over plans, monitoring success |
| Error Detection & Recovery | Mistake existence/classification, recovery action | Diagnosing and prescribing correction |
| Human Intent Understanding | Intention prediction, collaboration response | Inferring goals from context and interaction |
This multidimensional organization enables hierarchical aggregation of performance, such that researchers can isolate reasoning deficits at the global, dimension, or subtask level.
OmniEAR’s implementation divides 1,500 scenarios across seven task categories spanning single-agent (Direct Command, Tool Use, Attribute Reasoning, Compound Reasoning) and multi-agent (Explicit Collaboration, Implicit Collaboration, Compound Collaboration) axes, at increasing levels of cognitive complexity (L1-L3). Each scenario mandates grounding in continuous physical properties, tool affordances, and multi-agent processes (Wang et al., 7 Aug 2025).
3. Benchmark Design and Evaluation Protocols
ERIQ’s methodology is predicated on isolating the reasoning ability of vision-LLMs from action execution:
- In ERIQ (Liu et al., 30 Dec 2025): All questions are presented in a fixed multiple-choice or binary (Yes/No) format based on real robot egocentric imagery or text, not requiring trajectory rollout or motor control. Ground truth is deterministically specified, and model outputs are compared in a scoring script. Input modalities encompass single images (53%), sequential clips (26%), and image–text dialogue (21%), distributed across five domains (household, restaurant, supermarket, industrial, office).
- In OmniEAR (Wang et al., 7 Aug 2025): Tasks are simulated through a directed-graph environment (nodes for agents/objects/rooms, rich continuous properties, and dynamic edges), and successful plans must result in world-states that fulfill all logical predicates. Auxiliary operational metrics—average step count, relative trajectory length, token usage—quantify planning efficiency and reasoning process overhead.
This decoupling of reasoning from execution distinguishes ERIQ from end-to-end robotic evaluation, allowing systematic, large-scale benchmarking with clear interpretability.
4. Empirical Findings and Model Analyses
Empirical results across both benchmarks highlight the diagnostic power and granularity of the ERIQ framework:
- Correlational Validity: ERIQ score exhibits a "strong positive correlation" with downstream Vision-Language-Action task performance; models with higher ERIQ reliably achieve better real-world success rates, particularly in generalization to novel pick-and-place settings (Liu et al., 30 Dec 2025).
- Performance Breakdown (OmniEAR Table 1):
- GPT-4o: 96.6% direct, 80.0% tool, 32.0% compound collaboration
- Gemini-2.5: 90.5% direct, 82.3% tool, 40.5% compound collaboration
- Fine-tuning significantly improves single-agent performance (e.g., Qwen2.5-3B: 0.6%→76.3% on direct), but delivers minimal multi-agent gains (implicit collaboration: 1.5%→5.5%)
- Failure Modes: Common deficits include tool selection (31.2% exploration failures in tool use), compound-plan retention (28.7% planning degradation), and coordination timing (35.8% failures in implicit collaboration).
A strong implication is that current architectures remain brittle in settings requiring emergent collaboration, abstract constraint filtration, and flexible tool acumen, with substantial performance gaps relative to human-level embodied reasoning (Wang et al., 7 Aug 2025).
5. Distinctions from Traditional LLM Benchmarks
ERIQ diverges fundamentally from text-only LLM benchmarks (e.g., GLUE, SUPERGLUE, BIG-BENCH) in several ways:
- Evaluation is grounded in authentic physical settings, either through real-sensor data (robot egocentric views) or rich world-graph simulation, rather than static text.
- Reasoning demands include spatial perception, causal logic, intent inference, dynamic tool acquisition, and emergent collaboration, not mere linguistic inference or single-step logical entailment.
- Final scoring is predicated on environment state transitions and goal predicate satisfaction, not only prediction accuracy or F1.
- Auxiliary metrics (step count, token usage) directly quantify planning efficiency and process complexity.
- ERIQ’s aggregate and per-dimension scoring reveal model sensitivity to “emergent reasoning demands—tool inference, constraint filtering, dynamic teaming” not captured by prior benchmarks (Wang et al., 7 Aug 2025).
6. Broader Significance and Future Directions
ERIQ provides an extensible and reproducible foundation for embodied vision-language reasoning benchmarks. Its core contributions are:
- Formal quantification of embodied reasoning via average deterministic accuracy.
- Comprehensive coverage of spatial, causal, error recovery, and collaborative/intersubjective reasoning dimensions.
- Decoupling of cognitive evaluation from low-level execution bottlenecks.
- Empirical demonstration that reasoning capability, as isolated by ERIQ, is a primary driver of open-world robotic generalization.
These properties make ERIQ a critical tool for tracking progress in generalist robotic AI and identifying the limitations of vision-language-action models. A plausible implication is that further progress in embodied AI will require addressing the reasoning bottlenecks and failure modes that ERIQ exposes, particularly with respect to emergent collaboration, abstract constraint satisfaction, and tool-centric adaptation (Liu et al., 30 Dec 2025, Wang et al., 7 Aug 2025).