Reliability of ego-grounding in current multimodal LLMs

Determine whether contemporary multimodal large language models can perform ego-grounding reliably when answering personalized questions in egocentric videos, where the model must correctly resolve first-person references (e.g., “I,” “my”) and associate them with the camera wearer under partial visibility and long-range temporal context.

Background

Ego-grounding is defined as grounding first-person references—such as “I,” “my things,” and “my activities”—to the camera wearer in egocentric videos. This is challenging because the wearer is often only partially visible and relevant evidence may be temporally distant.

While multimodal LLMs have strong visual reasoning and long-context abilities, their capacity to achieve reliable ego-grounding in such conditions is not established. The paper motivates a systematic evaluation to clarify this uncertainty.

References

However, it remains unclear whether current models can actually perform ego-grounding reliably.

Ego-Grounding for Personalized Question-Answering in Egocentric Videos  (2604.01966 - Xiao et al., 2 Apr 2026) in Section 1, Introduction