Volumetric Joint-Embedding Predictive Architecture

Updated 15 January 2026

VJEPA is a self-supervised learning architecture that predicts high-level feature embeddings from masked volumetric and spatiotemporal data.
It uses volumetric patchification, dual encoders, and a predictive head to focus on semantically meaningful temporal and spatial dependencies.
Applications include video representation learning, facial expression recognition, and EEG analysis, demonstrating robust, state-of-the-art performance.

The Volumetric Joint-Embedding Predictive Architecture (VJEPA) is a class of self-supervised learning architectures for volumetric and spatiotemporal data which learns representations by predicting high-level feature embeddings of masked regions from unmasked context. Unlike pixel-level reconstruction methods, VJEPA operates entirely in the embedding space, enabling the encoder to discard irrelevant details and focus its capacity on semantically meaningful temporal and spatial dependencies. VJEPA and its variants have been successfully applied to video representation learning, facial expression recognition, and spatiotemporal EEG analysis, demonstrating strong generalization and robustness across diverse domains (Eing et al., 14 Jan 2026, Drozdov et al., 2024, Hojjati et al., 4 Jul 2025, Bardes et al., 2024).

1. Architectural Foundations

VJEPA consists of the following principal components:

Volumetric Patchification: Raw data (e.g., video clips or EEG windows) are partitioned into non-overlapping 3D patches or "tubelets." For video, an input tensor $V \in \mathbb{R}^{T \times H \times W \times 3}$ is split into tubes of size $\Delta_t \times \Delta_h \times \Delta_w$ . For EEG data, comparable volumetric partitions across channels and time are used (Eing et al., 14 Jan 2026, Hojjati et al., 4 Jul 2025).
Dual Encoder Structure: Two parallel Vision-Transformer (ViT)-style encoders of identical architecture:
- The online encoder $E_\theta$ only sees the unmasked (visible) tubes.
- The target (EMA) encoder $\hat{E}_{\hat{\theta}}$ receives the full, unmasked patch sequence and is updated via exponential moving average of the online encoder weights.
Predictive Head (Projection): A lightweight transformer or MLP receives the visible embeddings and a set of learnable mask tokens (with positional encodings), outputting predicted embeddings for all patch positions.

The core pipeline can be summarized as:

Compute context embeddings $s_x = E_\theta(x)$ for visible tokens ( $x$ with masked positions zeroed or replaced by mask tokens).
Compute target embeddings $s_y = \hat{E}_{\hat{\theta}}(y)$ for the full input.
The predictive head $P_\phi$ estimates the masked embeddings: $\hat{s}_y = P_\phi(s_x, z)$ .
Compute a loss only over masked positions using the target embeddings, with stop-gradient on the teacher branch.

2. Masking Strategies and Objective Functions

VJEPA's training objective centers on predicting hidden feature embeddings for masked regions, not pixels:

Masking: A large proportion of patches (typically $\sim$ 90% in video applications) are masked per sample, either via random selection or blockwise contiguous masking. The binary mask $m \in \{0, 1\}^N$ identifies which tokens are masked ( $m_i=1$ ) or visible ( $m_i=0$ ) (Eing et al., 14 Jan 2026, Bardes et al., 2024).
Loss Function: The predictive loss is computed as an average $\ell_1$ or $\ell_2$ norm over the predicted and teacher embeddings on masked positions:

$\mathcal{L}_{\text{VJEPA}} = \frac{1}{|M|}\sum_{i : m_i=1} \left\|\hat{s}_{y,i} - s_{y,i}\right\|_1,$

enforcing accurate feature-level prediction for non-visible regions (Eing et al., 14 Jan 2026). For additional collapse prevention and robustness, some variants introduce variance and covariance regularization penalties in the spirit of VICReg:

$L_{\text{vcr}}(\mathbf{H}) = \alpha\,L_{\text{var}}(\mathbf{H}) + \beta\,L_{\text{cov}}(\mathbf{H})$

where $\mathbf{H}$ stacks the batch embeddings, and $L_{\text{var}}$ , $L_{\text{cov}}$ penalize low variance and high feature correlation, respectively (Drozdov et al., 2024).

Predictive Paradigm: By regressing to target embeddings rather than pixels, VJEPA emphasizes high-level semantics such as motion and structure, rather than reconstructing low-level, often irrelevant details.

3. Design Choices, Training Protocols, and Variants

VJEPA has been instantiated with architectural and hyperparameter choices including:

Patch Sizes and Token Embedding: For video, tubes of $2 \times 16 \times 16$ (frames × pixels × pixels) are common, producing $N=1568$ tokens per $16 \times 224 \times 224$ input. For EEG, tubelets are adapted to temporal and channel structure, e.g., $4 \times 30 \times 4$ (frames × channels × samples) (Eing et al., 14 Jan 2026, Hojjati et al., 4 Jul 2025).
Backbones: ViT-base ( $D=768$ ) or larger (ViT-large, ViT-H/16 with $D=1024$ ) models are used. The predictor is a small transformer or a 2-layer MLP with skip connections.
Mask Ratios: Masking $\sim$ 90% of tokens is typical for challenging context in vision; EEG applications employ intermediate mask ratios (e.g., $\alpha \approx 0.6$ ).
Pre-training Regimens: Training leverages large-scale unlabeled corpora (e.g., HowTo100M, Kinetics, Something-Something v2), AdamW optimizer, large batch sizes ( $\sim$ 8k), and EMA momentum approaching $0.999$. Video artifacts such as shortness are handled by frame repetition (Eing et al., 14 Jan 2026, Bardes et al., 2024, Hojjati et al., 4 Jul 2025).
Data Augmentation: Random cropping, channel/time perturbations, and spatial flips are integrated depending on modality.
Collapse Prevention: EMA target encoders and, in some cases, variance-covariance regularization (VCR), ensure high-rank, diverse embeddings through stop-gradients and explicit regularization (Drozdov et al., 2024).

4. Applications and Empirical Performance

VJEPA and its adaptations exhibit state-of-the-art or highly competitive results across several application domains:

Facial Expression Recognition (FER): Using pre-trained VJEPA video encoders and shallow attentive probes, models match or outperform all other purely vision-based, self-supervised FER methods on datasets such as RAVDESS (WAR=72.93, UAR=76.40) and CREMA-D (WAR=78.86, UAR=79.39), surpassing prior pixel-reconstruction approaches such as VideoMAE variants (Eing et al., 14 Jan 2026).
General Video Representation Learning: VJEPA achieves strong performance when frozen and evaluated via attentive probes or with end-to-end fine-tuning across Kinetics-400 (81.9%–86.6%), Something-Something-v2 (72.2%–77.0%), and ImageNet-1K (77.4%) (Bardes et al., 2024).
Cross-Modal Transfer: Adaptations of VJEPA to EEG data (EEG-VJEPA) enable high-accuracy abnormal brain signal classification and physiologically interpretable embeddings, supporting clinical decision-making workflows (Hojjati et al., 4 Jul 2025).
Temporal Dynamics and Action Recognition: Variations introducing latent variables into the predictive head allow representing uncertainty and stochasticity in future predictions, improving probing accuracy for action dynamics and capturing diverse semantic aspects (Drozdov et al., 2024).

5. Theoretical Implications and Design Rationale

The feature-prediction paradigm in VJEPA generalizes principles from contrastive, masked autoencoding, and BYOL-style architectures by:

Operating in Representation Space: Focusing predictive loss on feature embeddings rather than raw pixels allows the encoder to extract high-level, task-relevant structure and discard nuisance variables (e.g., background color), yielding representations aligning better with downstream semantics and generalizing robustly, particularly in cross-dataset and cross-modality settings (Eing et al., 14 Jan 2026, Bardes et al., 2024).
Joint-Embedding Principle: The architecture’s dual-encoder, stop-gradient, and predictor configuration ensures non-collapsed, diverse representation learning without requiring negative samples or contrastive pairs.
Attentive Probing: For downstream tasks, attentive probe heads leveraging all spatial and temporal positions (rather than global pooling) substantially boost performance (e.g., +17% Kinetics-400 over naive pooling), evidencing the rich, location-specific information retained in the learned features (Bardes et al., 2024).
Optimality under $\ell_1$ Loss: The conditional median embedding is the optimum under $\ell_1$ prediction, which incentivizes the encoder to maximize information capture while minimizing the absolute deviation from target features.

A plausible implication is that removing pixel-level objectives in favor of embedding-based regression fundamentally alters what aspects of the input are preserved or discarded during pre-training, favoring features that transfer across domains and tasks.

6. Extensions, Ablations, and Limitations

Ablations on mask ratios, predictor depth, embedding sizes, and regularization are established primarily in foundational works (Bardes et al., 2024). Major findings include:

Mask Ratio and Attentive Pooling: High mask ratios introduce more challenging prediction tasks, improving robustness; attentive pooling outperforms average pooling by a significant margin.
Variance–Covariance Regularization: Explicit regularization (as in VJ-VCR) prevents feature collapse and increases embedding dimensionality, which in turn boosts downstream probe performance (e.g., mAP=67.4% in CATER action recognition, outperforming generative baselines (Drozdov et al., 2024)).
Task Adaptation and Transfer: VJEPA’s architecture accommodates adaptation to new modalities (e.g., EEG) with specialized tubelet design and masking policies; the adaptive masking strategy enables the model to generalize across spatial and temporal scales.

Some limitations arise in the re-use of pre-trained weights across domains, the dependency on large-scale unlabeled data, and the need for careful tuning of tubelet geometry and masking strategy for each modality.

In summary, Volumetric Joint-Embedding Predictive Architectures are a scalable and flexible class of feature prediction models for spatiotemporal self-supervision. By eschewing pixel-level prediction in favor of embedding regression, VJEPA frameworks concentrate modeling capacity on semantically relevant, transferable features, providing state-of-the-art performance across vision and neurophysiological analysis tasks (Eing et al., 14 Jan 2026, Drozdov et al., 2024, Hojjati et al., 4 Jul 2025, Bardes et al., 2024).