Simultaneously learning dense spatio-temporal and global semantic video representations

Determine training objectives and architectures for video representation learning that simultaneously preserve dense spatio-temporal structure necessary for localization, geometry, and tracking while also capturing motion dynamics that support global scene understanding and high-level recognition.

Background

The paper focuses on self-supervised learning from videos and highlights a trade-off observed in prior work: video JEPA models excel at global understanding and dynamics but struggle with fine-grained local structure, whereas image-focused SSL methods yield strong dense features but do not directly learn temporal dynamics.

V-JEPA 2.1 is proposed to address this gap by applying a dense predictive loss to both masked and visible tokens, combined with deep self-supervision, multi-modal tokenizers, and data/model scaling. The authors frame the broader challenge of unifying dense and global capabilities as an open problem in representation learning.

References

Despite rapid progress, learning representations that simultaneously preserve dense spatio-temporal structure (needed for localization, geometry, and tracking) while also capturing dynamics and supporting global understanding (needed for high-level recognition) remains an open challenge.

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning  (2603.14482 - Mur-Labadia et al., 15 Mar 2026) in Introduction (Section 1)