Reliability of ego-grounding in current multimodal LLMs
Determine whether contemporary multimodal large language models can perform ego-grounding reliably when answering personalized questions in egocentric videos, where the model must correctly resolve first-person references (e.g., “I,” “my”) and associate them with the camera wearer under partial visibility and long-range temporal context.
References
However, it remains unclear whether current models can actually perform ego-grounding reliably.
— Ego-Grounding for Personalized Question-Answering in Egocentric Videos
(2604.01966 - Xiao et al., 2 Apr 2026) in Section 1, Introduction