Persistent object tracking across frames in egocentric videos
Establish methods that enable multimodal foundation models to maintain persistent tracking of objects across frames in egocentric videos, thereby supporting a stable world-state memory rather than purely view-dependent evidence, as required by the Spatial Memory tasks in the SAW (Situated Awareness in the Real World) benchmark.
References
Persistent tracking of objects across frames remains an open challenge across models.
— Learning Situated Awareness in the Real World
(2602.16682 - Li et al., 18 Feb 2026) in Finding 3, Subsection "Failure to maintain persistent object memory", Section 5 (Analysis)