Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenEQA: Embodied Video QA Benchmark

Updated 15 January 2026
  • OpenEQA is a benchmark for embodied episodic-memory video question answering that evaluates agents' spatial reasoning, object affordance reasoning, and multimodal grounding from egocentric video streams.
  • It compiles data from diverse sources like ScanNet and HM3D, covering seven semantic categories and integrating human-annotated questions for robust evaluation.
  • Recent methods like Chain-of-View prompting and GraphPad structured scene memory showcase significant performance gains by enabling active viewpoint selection and inference-time memory updates without extra training.

OpenEQA is a benchmark and protocol designed for embodied, episodic-memory question answering in indoor environments, targeting the evaluation of agents’ spatial reasoning, multimodal grounding, and task-aligned perception from egocentric video streams. The framework has driven significant progress in both vision-language modeling and structured scene memory for Embodied AI. Below, key aspects are summarized, referencing technical results and methodologies from recent research (Suglia et al., 2024, Zhao et al., 8 Jan 2026, Ali et al., 1 Jun 2025).

1. Formal Definition and Scope

OpenEQA is an embodied video question answering benchmark focusing on episodic-memory tasks. Models are required to observe short egocentric clips (typically 2–60 s) captured indoors and to answer free-form natural-language questions about the depicted scenes. The benchmark stresses an agent’s ability to perform fine-grained spatial grounding, object affordance reasoning, and the integration of commonsense and world-knowledge, using only the visual context available from a user-centric video stream (Suglia et al., 2024).

Dataset composition includes two primary sources:

  • ScanNet: Photorealistic real-world scene scans.
  • HM3D: Synthetic scenes with realistic artifacts.

Questions and answers are human-annotated, distributed evenly across seven semantic categories: Object Recognition, Attribute Recognition, Object State Recognition, Object Localization, Spatial Reasoning, Functional Reasoning, and World Knowledge.

2. Evaluation Metrics and Protocol

Unlike traditional VQA that uses accuracy or F1-score on string matching, OpenEQA relies on LLM-based appropriateness scoring. For each question, a large-LLM (e.g., Llama 3 70B) is prompted to assign a rating ri∈{1,2,3,4,5}r_i\in\{1,2,3,4,5\} for the predicted answer’s relevance and correctness. The primary metric is the Mean Normalized Appropriateness Score (MNAS):

si=ri−14∈[0,1]s_i = \frac{r_i - 1}{4} \in [0,1]

MNAS=1N∑i=1Nsi=1N∑i=1Nri−14\mathrm{MNAS} = \frac{1}{N}\sum_{i=1}^N s_i = \frac{1}{N}\sum_{i=1}^N\frac{r_i-1}{4}

SE=1B−1∑b=1B(s‾(b)−s‾)2\mathrm{SE} = \sqrt{\frac{1}{B-1}\sum_{b=1}^B (\overline{s}^{(b)}-\overline{s})^2}

Traditional metrics (Accuracy, F1) are rarely used, but provided for baseline comparison.

3. Recent Architectures and Methodologies

3.1 Chain-of-View Prompting (CoV)

CoV (Zhao et al., 8 Jan 2026) is a training-free, test-time reasoning framework that wraps around frozen vision-LLMs (VLMs), enabling agents to perform active viewpoint selection and iterative camera movements in 3D scenes. CoV proceeds in two sequential stages:

  • Coarse-Grained View Selection: Filters TT sampled frames to select K≪TK\ll T anchor views most relevant to question QQ using a scoring function s(vi,Q)=VLM_score([vi,Q])s(v_i,Q)=\mathrm{VLM\_score}([v_i,Q]).
  • Fine-Grained View Adjustment: Interleaves reasoning with discrete camera actions (translations, rotations, view-switches), updating the context until enough information is gathered or a step budget LL is reached.

Performance gains are realized via focused context, active information acquisition, and test-time scaling—an enforced minimum action budget encourages the VLM to integrate more evidence across views (+11.56% average improvement in LLM-Match, with maximum gain of +13.62% on Qwen3-VL-Flash). CoV is model-agnostic and does not require extra training.

3.2 Structured Scene Memory and GraphPad

GraphPad (Ali et al., 1 Jun 2025) equips the agent with a modifiable structured memory consisting of a scene graph G=(N,E)G=(\mathcal{N},\mathcal{E}), navigation log, frame memory, and node-level scratch-pads for semantic notes. This memory is updated via inference-time, language-callable APIs:

  • find_objects(f,q)(f, q): Detects new instances in frame ff relevant to qq.
  • analyze_objects(f,{n},q)(f, \{n\}, q): Annotates specified nodes in ff with respect to qq.
  • analyze_frame(f,q)(f,q): Jointly discovers and annotates objects for qq.

The agent’s VLM orchestrates reasoning and API calls in an update loop until it decides it can answer. Online refinement allows targeted extraction of perceptually relevant evidence, yielding high accuracy (55.3%55.3\% with Gemini 2.0 Flash, a +3.0 pp gain over image-only baselines, using five times fewer frames).

3.3 Parameter-Efficient VLMs

AlanaVLM (Suglia et al., 2024), a 7B parameter video-LLM instruction-tuned on the Egocentric Video Understanding Dataset (EVUD), demonstrates robust performance on OpenEQA, setting open-source state of the art (MNAS = 46.7, +4.4 pp over Chat-UniVi, and outperforming Gemini 1.0 Pro and strong Socratic GPT-4-based models by +3.6 pp). Largest improvements occur in Spatial Reasoning and Object Localization categories.

4. Comparative Benchmark Results

Summary of key results across models, as reported for 50-frame settings unless stated:

Method MNAS Key Feature
GPT-4 (text-only) 33.5 No vision input
GPT-4V 55.3 Vision-language only
Claude 3 (20 f) 36.3 Strong LLM; limited vision
Gemini 1.0 Pro Vision 44.9 Multimodal vision
Gemini 1.5 Flash 72.5 State-of-the-art, large context
AlanaVLM (proposed) 46.7 Instruction-tuned on EVUD
Chat-UniVi (50 f) 42.3 Baseline video VLM
GraphPad (Gemini 2.0 Flash) 55.3 Inference-time 3D graph updates
CoV (Qwen3-VL-Flash, 1 step) 58.75 Chain-of-View active reasoning

A plausible implication is that advanced scene representation (GraphPad), active viewpoint reasoning (CoV), and egocentric instruction tuning (AlanaVLM) each drive measurable improvements on OpenEQA, with CoV and GraphPad matching static image-based VLMs using fewer, but more question-aligned, views.

5. Analysis, Limitations, and Category Performance

Fine-grained spatial and attribute reasoning are consistently improved by methods that incorporate either action-based exploration (CoV) or structured, editable memory (GraphPad). In-depth category analysis shows:

  • GraphPad: +20.3 pp in Attribute Recognition, +5.7 pp in Functional Reasoning, +3.1 pp in Object State Recognition over image-only baseline.
  • CoV: Ablations indicate a ~4.6% relative drop if coarse-grained selection is removed, confirming the importance of focused context.

Limitations include potential error propagation in the scene memory, finite API sets for modification, latency due to repeated VLM passes, and scaling costs as scene graph size increases. Both GraphPad and CoV do not require extra training or data collection, indicating high test-time flexibility.

6. Future Directions and Research Significance

Extensions to robotic manipulation, dynamic scene understanding, hierarchical memory structures, and automated semantic verification are identified as promising directions. OpenEQA’s challenging scenario and LLM-based scoring protocol catalyze advances in egocentric Embodied AI. The dataset’s balance across question categories and realistic, human-driven annotation protocols furnish robust settings for evaluating foundational embodied agents.

This suggests that OpenEQA remains a central benchmark for assessing embodied understanding in AI systems, and continues to bridge the gap between raw perceptual intake and high-level semantic reasoning in multimodal agents.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenEQA.