Papers
Topics
Authors
Recent
Search
2000 character limit reached

V-JEPA 2 Embeddings

Updated 17 January 2026
  • V-JEPA 2 embeddings are high-capacity, self-supervised video representations learned via masked latent prediction using a Vision Transformer and teacher–student framework.
  • They achieve state-of-the-art performance in motion understanding, action anticipation, and video QA by leveraging EMA or static frozen teachers in their training strategy.
  • Extensive ablations highlight the importance of large model scales, multi-block spatial masking, and decoupled teacher–student optimization for robust and efficient video representations.

Video Joint-Embedding Predictive Architecture 2 (V-JEPA 2) embeddings are high-capacity, self-supervised video representations learned through masked-latent prediction objectives applied to large-scale video and image corpora. These embeddings are produced by a Vision Transformer (ViT)-based encoder and specialized predictor, trained using a teacher–student framework that incorporates exponential moving average (EMA) or, in newer alternatives, static frozen teachers. V-JEPA 2 embeddings achieve state-of-the-art performance on motion understanding, action anticipation, video question answering, and zero-shot robotic planning, and serve as the foundation for subsequent architectural refinements such as compute-efficient pipelines employing frozen teachers (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).

1. Architecture of V-JEPA 2 Embedding Models

V-JEPA 2 utilizes a spatio-temporal Vision Transformer backbone as its encoder (designated EθE_\theta), with model size variants including ViT-L/16 ($1024$ hidden units, $24$ layers), ViT-H/16 ($1280$ hidden, $32$ layers), and ViT-g/16 ($1408$ hidden, $40$ layers). The encoder processes tubelets of size 2×16×162\times16\times16, produced by strided patchification. No [CLS] token is used; all spatial-temporal positions are treated uniformly.

A lightweight ViT-based predictor network PϕP_\phi or gϕg_\phi (typ. 12 layers, $384$ width) takes the masked encoder outputs, injecting learned mask tokens at the masked positions, and predicts the associated withheld latents. Positional information is injected using 3D rotary positional embeddings (RoPE), splitting the hidden dimension for separate temporal, height, and width encoding.

A teacher network is maintained as an EMA copy of the student (ψtβψt+(1β)θ\psi^t \leftarrow \beta \psi^t + (1-\beta)\theta, β0.999\beta \approx 0.999), whose weights provide the prediction targets during training and prevent collapse of representation learning (Assran et al., 11 Jun 2025, Bardes et al., 2024).

2. Training Objectives and Masking Strategies

V-JEPA 2 employs a joint-embedding, masked-latent prediction objective. Given a video, a high-ratio (≈90%) “multi-block” spatial mask is applied (combining short-range and long-range 3D blocks), partitioning the video into context xx and masked yy regions. The student encoder fθf_\theta (or EθE_\theta) processes xx and is required, via the predictor gϕg_\phi (or PϕP_\phi), to predict the teacher’s (EMA) embeddings of the masked regions:

Llatent=Ex,ygϕ(fθ(x),δy)stop_grad(fˉψ(y))1\mathcal{L}_{\mathrm{latent}} = \mathbb{E}_{x, y} \left\| g_\phi(f_\theta(x), \delta y) - \mathrm{stop\_grad}(\bar{f}_\psi(y)) \right\|_1

The “stop-gradient” operation blocks gradients flowing into the teacher to prevent representational collapse, analogously to BYOL and Data2Vec.

Alternative training strategies, such as SALT (Static-teacher Asymmetric Latent Training), replace the EMA teacher with a frozen teacher trained in an initial stage with a masked pixel-reconstruction objective (“VideoMAE-style pixel MAE”), then freeze its weights and use them as unchanging prediction targets in a second stage. This decouples teacher and student optimization and eliminates the need for EMA or explicit collapse-prevention regularizers (Li et al., 29 Sep 2025).

3. Embedding Extraction and Properties

After pretraining, inference proceeds by freezing the encoder weights. For downstream tasks, per-tubelet embeddings (with dimension equal to the encoder’s hidden width) can be aggregated via an attentive probe transformer (e.g., 4 blocks with cross-attention and learned query token) and fed to a linear classifier, or projected through an MLP head for alignment with LLMs in video QA workflows. In robotic world models, embeddings are stacked with action tokens for use in autoregressive latent-space dynamics models.

The embeddings:

  • Retain spatio-temporal structure (no explicit global pooling during pretraining),
  • Feature no 2\ell_2 normalization or additional projection heads (beyond those for specific integrations),
  • Are robust to small or sub-optimal teacher networks in the SALT approach, with large student models outperforming their frozen teachers by $2$–4%4\% in frozen-probing accuracy (Li et al., 29 Sep 2025).

4. Empirical Performance and Analysis

V-JEPA 2 embeddings deliver strong results across video and image benchmarks:

Model SSv2 Top-1 K400 Top-1 Pretrain FLOPs
V-JEPA 2 ViT-L 73.7% 85.1% 1.9×10211.9 \times 10^{21}
SALT ViT-L 74.9% 85.4% 1.2×10211.2 \times 10^{21}

SALT-trained students are consistently more accurate for matched compute, and dominate V-JEPA’s accuracy–FLOPs Pareto frontier ($20$–40%40\% fewer FLOPs required for any given accuracy) (Li et al., 29 Sep 2025).

The embeddings are highly effective beyond classification:

  • Action anticipation: State-of-the-art recall@5 on Epic-Kitchens-100.
  • Video QA (after MLLM alignment): SOTA results at the 8B-parameter scale, e.g., PerceptionTest ($84.0$), TempCompass ($76.9$).
  • Robotic world models: V-JEPA 2-AC enables zero-shot object manipulation across unseen environments, outperforming specialized baselines (Assran et al., 11 Jun 2025).

Downstream linear probing performance also demonstrates substantial robustness to teacher quality (in SALT), and student model performance tracks cross-entropy loss with R2>0.95R^2 > 0.95, simplifying model selection and early-stopping (Li et al., 29 Sep 2025).

5. Ablation Studies and Design Choices

Extensive ablations establish the importance of:

  • Large model and data scale: Improvements in accuracy and generalization,
  • Multi-block spatial masking: Enables high-ratio token masking without collapse,
  • Progressive input resolution: Increasing clip length and spatial size at later stages provides additive gains,
  • Deep attentive probes for evaluation: 4-layer probes outperform simpler pooling,
  • Decoupling teacher and student optimization (SALT): Higher efficiency and stability, even for sub-optimal teachers (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).

Notably, pixel-reconstruction objectives (VideoMAE) underperform feature-prediction (V-JEPA) methods under frozen encoder evaluation on most video tasks. The crucial bias favoring general video representations appears to be the masked-latent objective itself, not the EMA machinery.

6. Extensions and Applications

Beyond standard representation learning, V-JEPA 2 embeddings have been adapted for value-guided planning in world models (Destrade et al., 28 Dec 2025). By shaping the embedding space to approximate negative goal-conditioned value functions, they enable model predictive control with improved planning success. The learned Euclidean or quasi-metric distances between embeddings serve as proxies for expected reaching costs, supporting control in complex settings.

Additionally, the architecture’s modularity facilitates integration with LLMs for multimodal video QA, and with action-conditioned latent predictors for manipulation tasks, requiring only adaptations at the projection and aggregation stage—not to the backbone embedding itself (Assran et al., 11 Jun 2025, Li et al., 29 Sep 2025).

7. Limitations and Open Directions

V-JEPA 2 still relies on extensive data curation and scale for best performance, with masking strategies tuned to motion/appearance granularity. EMA-based training introduces decoupling and model selection challenges, addressed but not entirely eliminated by frozen-teacher approaches. Furthermore, while embedding space shaping directly for planning provides utility in simple tasks, robustness and generality in more complex robotic and interactive environments remain active research directions (Li et al., 29 Sep 2025, Destrade et al., 28 Dec 2025).

A plausible implication is that future efforts will focus on further decoupling large-scale representation pretraining from specialized downstream objectives, enabling more interpretable, efficient, and general video world models for planning and multimodal reasoning.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-JEPA 2 Embeddings.