OpenEQA: Embodied Video QA Benchmark
- OpenEQA is a benchmark for embodied episodic-memory video question answering that evaluates agents' spatial reasoning, object affordance reasoning, and multimodal grounding from egocentric video streams.
- It compiles data from diverse sources like ScanNet and HM3D, covering seven semantic categories and integrating human-annotated questions for robust evaluation.
- Recent methods like Chain-of-View prompting and GraphPad structured scene memory showcase significant performance gains by enabling active viewpoint selection and inference-time memory updates without extra training.
OpenEQA is a benchmark and protocol designed for embodied, episodic-memory question answering in indoor environments, targeting the evaluation of agents’ spatial reasoning, multimodal grounding, and task-aligned perception from egocentric video streams. The framework has driven significant progress in both vision-language modeling and structured scene memory for Embodied AI. Below, key aspects are summarized, referencing technical results and methodologies from recent research (Suglia et al., 2024, Zhao et al., 8 Jan 2026, Ali et al., 1 Jun 2025).
1. Formal Definition and Scope
OpenEQA is an embodied video question answering benchmark focusing on episodic-memory tasks. Models are required to observe short egocentric clips (typically 2–60 s) captured indoors and to answer free-form natural-language questions about the depicted scenes. The benchmark stresses an agent’s ability to perform fine-grained spatial grounding, object affordance reasoning, and the integration of commonsense and world-knowledge, using only the visual context available from a user-centric video stream (Suglia et al., 2024).
Dataset composition includes two primary sources:
- ScanNet: Photorealistic real-world scene scans.
- HM3D: Synthetic scenes with realistic artifacts.
Questions and answers are human-annotated, distributed evenly across seven semantic categories: Object Recognition, Attribute Recognition, Object State Recognition, Object Localization, Spatial Reasoning, Functional Reasoning, and World Knowledge.
2. Evaluation Metrics and Protocol
Unlike traditional VQA that uses accuracy or F1-score on string matching, OpenEQA relies on LLM-based appropriateness scoring. For each question, a large-LLM (e.g., Llama 3 70B) is prompted to assign a rating for the predicted answer’s relevance and correctness. The primary metric is the Mean Normalized Appropriateness Score (MNAS):
Traditional metrics (Accuracy, F1) are rarely used, but provided for baseline comparison.
3. Recent Architectures and Methodologies
3.1 Chain-of-View Prompting (CoV)
CoV (Zhao et al., 8 Jan 2026) is a training-free, test-time reasoning framework that wraps around frozen vision-LLMs (VLMs), enabling agents to perform active viewpoint selection and iterative camera movements in 3D scenes. CoV proceeds in two sequential stages:
- Coarse-Grained View Selection: Filters sampled frames to select anchor views most relevant to question using a scoring function .
- Fine-Grained View Adjustment: Interleaves reasoning with discrete camera actions (translations, rotations, view-switches), updating the context until enough information is gathered or a step budget is reached.
Performance gains are realized via focused context, active information acquisition, and test-time scaling—an enforced minimum action budget encourages the VLM to integrate more evidence across views (+11.56% average improvement in LLM-Match, with maximum gain of +13.62% on Qwen3-VL-Flash). CoV is model-agnostic and does not require extra training.
3.2 Structured Scene Memory and GraphPad
GraphPad (Ali et al., 1 Jun 2025) equips the agent with a modifiable structured memory consisting of a scene graph , navigation log, frame memory, and node-level scratch-pads for semantic notes. This memory is updated via inference-time, language-callable APIs:
- find_objects: Detects new instances in frame relevant to .
- analyze_objects: Annotates specified nodes in with respect to .
- analyze_frame: Jointly discovers and annotates objects for .
The agent’s VLM orchestrates reasoning and API calls in an update loop until it decides it can answer. Online refinement allows targeted extraction of perceptually relevant evidence, yielding high accuracy ( with Gemini 2.0 Flash, a +3.0 pp gain over image-only baselines, using five times fewer frames).
3.3 Parameter-Efficient VLMs
AlanaVLM (Suglia et al., 2024), a 7B parameter video-LLM instruction-tuned on the Egocentric Video Understanding Dataset (EVUD), demonstrates robust performance on OpenEQA, setting open-source state of the art (MNAS = 46.7, +4.4 pp over Chat-UniVi, and outperforming Gemini 1.0 Pro and strong Socratic GPT-4-based models by +3.6 pp). Largest improvements occur in Spatial Reasoning and Object Localization categories.
4. Comparative Benchmark Results
Summary of key results across models, as reported for 50-frame settings unless stated:
| Method | MNAS | Key Feature |
|---|---|---|
| GPT-4 (text-only) | 33.5 | No vision input |
| GPT-4V | 55.3 | Vision-language only |
| Claude 3 (20 f) | 36.3 | Strong LLM; limited vision |
| Gemini 1.0 Pro Vision | 44.9 | Multimodal vision |
| Gemini 1.5 Flash | 72.5 | State-of-the-art, large context |
| AlanaVLM (proposed) | 46.7 | Instruction-tuned on EVUD |
| Chat-UniVi (50 f) | 42.3 | Baseline video VLM |
| GraphPad (Gemini 2.0 Flash) | 55.3 | Inference-time 3D graph updates |
| CoV (Qwen3-VL-Flash, 1 step) | 58.75 | Chain-of-View active reasoning |
A plausible implication is that advanced scene representation (GraphPad), active viewpoint reasoning (CoV), and egocentric instruction tuning (AlanaVLM) each drive measurable improvements on OpenEQA, with CoV and GraphPad matching static image-based VLMs using fewer, but more question-aligned, views.
5. Analysis, Limitations, and Category Performance
Fine-grained spatial and attribute reasoning are consistently improved by methods that incorporate either action-based exploration (CoV) or structured, editable memory (GraphPad). In-depth category analysis shows:
- GraphPad: +20.3 pp in Attribute Recognition, +5.7 pp in Functional Reasoning, +3.1 pp in Object State Recognition over image-only baseline.
- CoV: Ablations indicate a ~4.6% relative drop if coarse-grained selection is removed, confirming the importance of focused context.
Limitations include potential error propagation in the scene memory, finite API sets for modification, latency due to repeated VLM passes, and scaling costs as scene graph size increases. Both GraphPad and CoV do not require extra training or data collection, indicating high test-time flexibility.
6. Future Directions and Research Significance
Extensions to robotic manipulation, dynamic scene understanding, hierarchical memory structures, and automated semantic verification are identified as promising directions. OpenEQA’s challenging scenario and LLM-based scoring protocol catalyze advances in egocentric Embodied AI. The dataset’s balance across question categories and realistic, human-driven annotation protocols furnish robust settings for evaluating foundational embodied agents.
This suggests that OpenEQA remains a central benchmark for assessing embodied understanding in AI systems, and continues to bridge the gap between raw perceptual intake and high-level semantic reasoning in multimodal agents.