Persistent object tracking across frames in egocentric videos

Establish methods that enable multimodal foundation models to maintain persistent tracking of objects across frames in egocentric videos, thereby supporting a stable world-state memory rather than purely view-dependent evidence, as required by the Spatial Memory tasks in the SAW (Situated Awareness in the Real World) benchmark.

Background

Within the SAW benchmark’s Spatial Memory tasks, the authors observe that models frequently infer that objects no longer exist when they exit the camera’s field of view, indicating reliance on view-dependent evidence rather than a persistent world-state representation.

This behavior leads to incorrect conclusions about object presence across time during egocentric motion, motivating the need for models that can maintain object persistence across frames—a capability the paper explicitly identifies as an open challenge.

References

Persistent tracking of objects across frames remains an open challenge across models.

Learning Situated Awareness in the Real World  (2602.16682 - Li et al., 18 Feb 2026) in Finding 3, Subsection "Failure to maintain persistent object memory", Section 5 (Analysis)