Simultaneously learning dense spatio-temporal and global semantic video representations
Determine training objectives and architectures for video representation learning that simultaneously preserve dense spatio-temporal structure necessary for localization, geometry, and tracking while also capturing motion dynamics that support global scene understanding and high-level recognition.
References
Despite rapid progress, learning representations that simultaneously preserve dense spatio-temporal structure (needed for localization, geometry, and tracking) while also capturing dynamics and supporting global understanding (needed for high-level recognition) remains an open challenge.
— V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
(2603.14482 - Mur-Labadia et al., 15 Mar 2026) in Introduction (Section 1)