Ego-Grounding for Personalized Question-Answering in Egocentric Videos

Published 2 Apr 2026 in cs.CV, cs.AI, and cs.RO | (2604.01966v1)

Abstract: We present the first systematic analysis of multimodal LLMs (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces MyEgo, a novel large-scale benchmark evaluating the ability of MLLMs to perform personalized, egocentric VideoQA.
It details rigorous diagnostic protocols testing first-person reference resolution, temporal memory, and identity tracking in dynamic environments.
Empirical results reveal a significant gap between human and model performance, highlighting challenges in long-term ego-grounding for real-world AI assistants.

Ego-Grounding for Personalized Question-Answering in Egocentric Videos: An Expert Analysis

Introduction and Problem Motivation

The increasing ubiquity of first-person video capture via wearable devices motivates the exploration of genuinely personalized AI assistance grounded in egocentric video understanding. "Ego-Grounding for Personalized Question-Answering in Egocentric Videos" (2604.01966) presents a methodical investigation into the capabilities and limitations of existing multimodal LLMs (MLLMs) when tasked with personalized question answering (QA) that necessitates robust ego-grounding: that is, the discernment, tracking, and temporal association of "me," "my things," "my actions," and "my history" from the camera wearer's (CW) perspective in complex, multi-agent environments.

Conventional VideoQA benchmarks and prior work in personalized VLMs lack rigorous evaluation of models’ ability to resolve first-person references over long spatial-temporal contexts with frequent distractors. This work addresses this gap by introducing a novel dataset and a comprehensive diagnostic protocol targeting such personalized, temporally extended reasoning.

Figure 1: Example of egocentric personalized QA concerning my demands -- models are required to resolve highly contextual queries about the camera wearer's objects and actions.

The MyEgo Benchmark

Dataset Characteristics and Task Formulation

MyEgo is proposed as the first large-scale, diagnostic dataset for egocentric personalized VideoQA with explicit emphasis on ego-grounding. It contains 541 long-form videos (mean 9.2 minutes) collected from Ego4D, EgoLife, and CASTEL2024, curated to ensure diverse scenarios, multi-person interactions, and numerous visually confounding distractors.

The core QA tasks are constructed by posing over 5,000 manually annotated, first-person-perspective questions at various timestamps, each demanding models to resolve the referent of “I,” “my object,” or “my past,” often in the presence of similar actors/objects and significant history/context dependencies. Approximately 60% of the questions focus on object-level references, 30% on actions, and 10% on other aspects, with an average temporal offset of ~20 seconds between question and answer moments.

Figure 2: Model must determine which objects or hands belong to the camera wearer and track their identity among visually similar distractors across time.

Figure 3: Statistical analysis of the MyEgo benchmark, showing the distribution of question types, answer locality, and the prevalence of ego-grounding demands.

MyEgo supports both open-ended QA and challenging multiple-choice QA (including MC-2 and MC-5 variants), with well-crafted distractors that preclude answer shortcuts and force models to faithfully perform ego-grounding.

Comparison to Prior Datasets

Unlike datasets such as EgoSchema, EgoThink, or EgoMemoria, which primarily emphasize generic video understanding in the first-person, MyEgo uniquely benchmarks models on core personalized reasoning abilities—specifically, discriminating the CW and their belongings/activities in the presence of confounding evidence from other people and objects in the environment.

Systematic Evaluation of MLLMs

Experimental Protocol

A comprehensive evaluation is conducted across a diverse spectrum of open-source and proprietary MLLMs, including state-of-the-art systems such as GPT-5, Qwen3-VL, InternVL3.5, Gemini-2.5 Pro, and memory-augmented streaming models (e.g., LongVA, LongVU, Flash-VStream, Dispider). The evaluation procedure involves both automatic and human grading, with careful attention to first-person prompting, input sampling (uniform and key-moment extraction), and assessments under varying context window sizes and parameter counts.

Key Findings

Quantitative Model Performance

Strong empirical results are highlighted by:

Top closed-source (GPT-5): Only 46.1% open-ended accuracy (vs. 84.7% human).
Top open-source (Qwen3-VL/InternVL3.5): 36–41% accuracy.
Multiple-choice QA: Even for binary MC-QA, most models perform at or near chance (50–55%), demonstrating the non-triviality of curated distractors.
Human–model gap: On all categories, a >30% absolute difference persists between the best model and human annotators.
Scaling laws: Increasing model size (from 4B to 38B) yields marginal and often inconsistent improvements.
Chain-of-thought (CoT) prompting: Adds negligible or unstable gains, contrasting with prior results on generic VideoQA.
Figure 4: Visualization of models' predictions on challenging instances, illustrating misgrounding and context confusion even on seemingly simple queries.

Failure Modes and Analysis

In-depth error analysis uncovers systematic deficiencies:

Identity confusion: Models frequently conflate the CW's actions or objects with those of others, especially with minimal egocentric cues (e.g., partial hand/arm visibility).
Object association failures: Persistent errors when visually similar objects are present, underlining an inability to bind objects to personal interaction history.
History retrieval breakdown: Accuracy on questions whose answers lie further in the past ("Previous") drops precipitously.
Short-term bias: MLLMs tend to answer based on the most recent or currently salient visual evidence, disregarding critical temporal context.
Figure 5: In real-world ego-centric scenarios, all models fail to answer personalized questions that demand robust grounding of the camera wearer's activities and objects over extended interactions.

Temporal Memory and Input Sampling

Accuracy decay: With uniform frame sampling, accuracy declines markedly as the temporal span between question and answer increases.
Figure 6: When key ground-truth (GT) moments are not provided, accuracy decays as temporal distance increases. Inputting GT evidence helps stabilize performance.
Moment-aware sampling: Selectively inputting frames at the annotated question and answer moments (Q&A sampling) robustly increases accuracy (gains up to 11% on "Previous" questions), but does not close the human-model gap.
Frame number: Increasing the number of input frames does not systematically help; more frames may introduce noise, diminishing the signal from sparse ego-centric cues.
Figure 7: Increasing the number of frames yields diminishing returns—performance plateaus or decreases after a small context window.

Prompt Sensitivity

Explicit reference: Prompts that clarify pronoun mapping (“I” as “camera wearer”) can yield minor improvements, but lack of such cues primarily degrades performance.
Personalization removal: Eliminating personal context from questions further harms model performance, confirming reliance (however weak) on explicit self-referential information.
Figure 8: Modifying prompt phrasing to either remove or reinforce first-person references yields small accuracy shifts, with enhancement generally more effective than removal.

Agreement with Human Evaluation

Automated grader: Model-based evaluation (e.g., GPT-5 mini) achieves high (>94%) concordance with human judgment for OE-QA, substantiating the robustness of analysis.
Figure 9: High agreement between human and model evaluation is attained for open-ended answer quality.

Theoretical and Practical Implications

The empirical findings demonstrate that, despite advances in video-level visual-language modeling and scaling, current MLLMs lack (i) robust long-term memory and (ii) coherent, persistent personal identity tracking over time. Improvements from recent architectural modifications (e.g., memory augmentation, temporal attention) are secondary compared to the need for fundamentally richer ego-centric representations.

The inability of MLLMs to resolve "I" and "my things" reliably, particularly in long-term, multi-agent, or visually ambiguous contexts, undermines their applicability to genuinely personalized assistants for lifelogging, daily activity QA, or in-situ embodied reasoning. Achieving human-like performance will necessitate breakthroughs in egocentric association, persistent object/action tracking, and long-range, dynamic context modeling.

Directions for Future Research

The analysis signals immediate priorities for future work:

Dedicated ego-grounding modules: Identity and object anchoring architectures resilient to distractors and minimal ego-appearance cues.
Efficient long-term memory: Mechanisms (retrieval-augmented, cache-based, externalized memory) for scalable storage and lookup of temporally distant, personalized events.
Integrated contextual world models: Hybrid symbolic-neural methods for explicit modeling of user trajectory, intent, and object interactions.
Enhanced data and negative sampling: Further curation of adversarial context, ambiguous referents, and fine-grained activity-entity annotation to push model generalization beyond direct evidence retrieval.

Conclusion

This work constitutes the first rigorous, quantitative, and diagnostic study of ego-grounding in personalized VideoQA. The MyEgo benchmark, task formulation, and in-depth analysis expose the critical limitations of state-of-the-art MLLMs in personalized, egocentric reasoning and highlight open challenges central to the development of practical AI assistants. Addressing these gaps will be necessary for the deployment of robust, context-aware systems capable of real-world, on-body multimodal reasoning and memory over extended personal experiences.

Markdown Report Issue