MC-JEPA: Joint Motion & Content Learning

Updated 27 January 2026

The paper demonstrates that unifying optical flow estimation and content discrimination using a shared encoder leads to mutual improvements across motion and segmentation benchmarks.
MC-JEPA integrates PWC-Net–style flow estimators with VICReg-based embedding to co-optimize multi-scale features from both video and still images.
Experiments reveal competitive performance with state-of-the-art unsupervised methods, validating the effectiveness of joint multitask self-supervised learning.

Motion-Content JEPA (MC-JEPA) is a self-supervised joint-embedding predictive architecture designed to unify the learning of optical flow (motion) and semantic content features within a single model. Unlike previous approaches that independently addressed either motion estimation or content discrimination, MC-JEPA employs a shared encoder to enable direct co-optimization, resulting in mutual improvements for both tasks. The architecture achieves performance on par with established unsupervised optical flow methods and leading self-supervised learning (SSL) techniques on major benchmarks for semantic segmentation and video analysis (Bardes et al., 2023).

1. Architectural Design and Components

MC-JEPA utilizes a single ConvNeXt-T–based encoder, denoted $E_\theta$ , which integrates the processing of both video and image data streams. Given input in the form of either consecutive video frames $(I_t, I_{t+1})$ or two randomly augmented crops from a still image, $E_\theta$ computes a pyramid of multi-scale feature maps:

$X^{(l)} := E_\theta(I_t) \in \mathbb{R}^{d^{(l)} \times h^{(l)} \times w^{(l)}}, \quad l = 1\ldots L,$

where $L=6$ levels span from coarse to near-pixel resolution.

The joint embedding space branches into:

Motion (Flow) Branch: At each scale, PWC-Net–style flow estimators $F_\phi$ compute residual optical flow updates. The complete flow is progressively computed coarse-to-fine:

$f_{t\to t+1}^{(1)} = F_\phi(X_t^{(1)}, X_{t+1}^{(1)}, 0), \quad f_{t\to t+1}^{(l+1)} = F_\phi(X_t^{(l+1)}, X_{t+1}^{(l+1)}, f_{t\to t+1}^{(l)}).$

Feature maps at $t+1$ are reconstructed by warping $X_t^{(l)}$ with the estimated flow via bilinear sampling.

Content (Self-Supervised) Branch: Two augmented versions $v_1, v_2$ of a still image yield $E_\theta(v_1), E_\theta(v_2)$ , which are global-pooled and passed to a three-layer expander network $g_\psi$ to produce 8192-dimensional embeddings $z_1, z_2$ . These are aligned using the VICReg variance–invariance–covariance objective.

The architectural coupling, with all heads sharing $E_\theta$ , fosters precise alignment and information exchange between motion and content feature learning (Bardes et al., 2023).

2. Objective Functions and Optimization Strategies

MC-JEPA’s training leverages multi-term loss functions across the two branches and integrates them within a unified multitask loss:

Optical Flow Losses:
- Feature Regression: Multi-scale feature warping and regression,
$\mathcal{L}_{\mathrm{reg}} = \sum_{l=1}^L \| X_{t+1}^{(l)} - W(X_t^{(l)}, f_{t\to t+1}^{(l)}) \|_2^2$ - Photometric Reconstruction: Mixing $\ell_1$ , $\ell_2$ and SSIM criteria,

$\mathcal{L}_{\mathrm{rec}} = d\bigl(I_{t+1},\,W(I_t,\,f_{t\to t+1})\bigr), \; d = \lambda_1 \|\cdot\|_1 + \lambda_2 \|\cdot\|_2^2 + \lambda_3 \mathrm{SSIM}(\cdot)$ - Edge-Aware Smoothness: Penalizes inconsistent local flow regularized by image gradients,

$\mathcal{L}_{\mathrm{smooth}} = \sum_p \sum_{d \in \{x, y\}} e^{-\lambda |\nabla_d I(p)|} \| \nabla_d f_{t\to t+1}(p) \|_1$ - Cycle Consistency: Encourages forward-backward flow consistency,

$\mathcal{L}_{\mathrm{cycle}} = \sum_{l=1}^L \| X_t^{(l)} - W(W(X_t^{(l)}, f_{t\to t+1}), f_{t+1\to t}) \|_2^2$ - Variance–Covariance Regularization: Per-layer stabilization term,

$\mathcal{L}_{vc} = \sum_{l=1}^L \Bigg\{ \frac{1}{d^{(l)}}\sum_{j=1}^{d^{(l)}} \max\Big(0,\,\gamma-\sqrt{\mathrm{Var}(X_{t,j}^{(l)})}\Big) + \frac{1}{d^{(l)}}\sum_{i\neq j} [C(X_t^{(l)})]_{i,j}^2 \Bigg\}$
Content (VICReg) Loss:
- Invariance: Alignment of view embeddings,
$\mathcal{L}_{\mathrm{inv}} = \|z_1 - z_2\|_2^2$ - Variance: Ensures embedding spread,

$\mathcal{L}_{\mathrm{var}} = \frac{1}{D}\sum_j \max(0,\,\gamma - \sqrt{\mathrm{Var}(z_{:,j})} + \varepsilon)$ - Covariance: De-correlates output dimensions,

$\mathcal{L}_{\mathrm{cov}} = \frac{1}{D} \sum_{i \neq j} [C(z)]_{i,j}^2$ - Combined content loss: $\mathcal{L}_{\mathrm{ssl}} = \mathcal{L}_{\mathrm{inv}} + \lambda_{\mathrm{var}} \mathcal{L}_{\mathrm{var}} + \lambda_{\mathrm{cov}} \mathcal{L}_{\mathrm{cov}}$
Total Joint Objective: For each iteration, batches are drawn from video (for flow) and ImageNet (for SSL), and

$\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{rec}} + \mathcal{L}_{\mathrm{smooth}} + \mathcal{L}_{\mathrm{cycle}} + \mathcal{L}_{vc,\,\mathrm{flow}} + \alpha \mathcal{L}_{\mathrm{ssl}}$

with $\alpha \approx 0.1$ yielding the best trade-off (Bardes et al., 2023).

3. Feature Interaction and Mutual Benefits

Backpropagation through the shared encoder $E_\theta$ ensures strong interaction between the flow and content objectives:

Flow-based gradients promote the preservation of high-frequency and spatially local features, benefiting pixel-level localization within semantic embeddings.
VICReg invariance discourages overfitting to motion artifacts, driving the encoder toward semantic attributes while maintaining compatibility with accurate flow estimation.
The variance–covariance regularizer stabilizes representation learning, mitigating conflicts between motion and semantic gradients.
The architecture is optimized via joint multitask objectives rather than sequential training, enhancing co-adaptation.

Empirical analysis demonstrates that introducing the VICReg loss into pure motion learning improves final flow metrics, while supplementing standard VICReg with a flow branch elevates segmentation accuracy (e.g., VOC mIoU from 60.1 to 67.1) (Bardes et al., 2023).

4. Experimental Protocols and Evaluation Benchmarks

MC-JEPA employs comprehensive pretraining and evaluation strategies, with datasets and metrics spanning both motion and content domains:

Pretraining Datasets:
- Flow: FlyingChairs, FlyingThings, KITTI raw/multiview (2012, 2015), MPI Sintel (raw, clean, final), HD1K.
- Content: ImageNet-1k.
Flow Evaluation: MPI Sintel (clean/final, average EPE), KITTI 2015 (EPE, F1 >3px error).
Segmentation Benchmarks: Pascal VOC, Cityscapes, ADE20K—linear probe and fine-tuned, measured by mIoU.
Video Analysis: DAVIS 2017—mean region similarity $J_m$ , contour accuracy $F_m$ , and their average.

See Table 1 for comparative results:

Method	KITTI EPE ↓	Sintel EPE ↓	VOC mIoU ↑	Cityscapes mIoU ↑	ADE20K mIoU ↑	DAVIS (JF) ↑
UFlow (PWC)	11.13	6.50	–	–	–	–
UPFlow	9.38	5.32	–	–	–	–
SMURF (RAFT)	6.83	4.18	–	–	–	–
VICRegL	–	–	79.7	78.3	44.1	66.7
DINO	–	–	79.5	78.1	43.5	69.9
MC-JEPA	11.33% F₁	6.12	79.9	78.4	44.2	70.5

MC-JEPA demonstrates flow performance comparable to dedicated unsupervised methods and matches or outperforms leading SSL models in segmentation and DAVIS video analysis (Bardes et al., 2023).

5. Ablation Analyses and Design Sensitivities

Extensive ablation studies highlight the sensitivity and interdependence of losses, architectures, and data strategies:

Incorporating VICReg alongside flow improves flow EPE, while adding the flow head to VICReg baselines increases segmentation mIoU.
LayerNorm in the PWC head is essential to avoid gradient/weight instabilities; $l_2$ -normalization of flow output degrades optical flow quality.
Alternate batch/epoch sampling harms joint training performance; batchwise combined-loss sampling is optimal (e.g., flow EPE 2.67, VOC 67.1, DAVIS 70.5).
Flow training is most effective when introduced after $\sim$ 10 SSL epochs.
Optimal cycle-consistency and multi-task trade-off weights ( $\sim$ 0.2 and $\sim$ 0.1, respectively) are critical; misconfiguration degrades both flow and segmentation results.
Pretraining the variance–covariance regularizer for 1 epoch prior to full multitask optimization slightly improves stability and accuracy.

6. Training Regimen and Augmentation Policies

The model is trained on 8× Tesla V100 GPUs for 100 epochs (∼3–4 days). The optimizer is AdamW, with $(\beta_1, \beta_2) = (0.9, 0.999)$ and weight decay $=10^{-6}$ . Learning rates are set as $3 \times 10^{-4}$ for the encoder and VICReg head, and $10^{-4}$ for the flow head (starting after 10 epochs), both with cosine decay and a 10-epoch warm-up. Batch sizes are 384 for SSL and 8 for flow tasks. Data augmentations for ImageNet comprise random crops/scales ([0.08,1.0]), resizing to $224 \times 224$ , color jitter, and Gaussian blur; flow datasets use dataset-specific fixed resolutions and standard geometric/color augmentations.

7. Conclusion and Significance

MC-JEPA establishes that a single joint-embedding predictive architecture can co-learn high-quality motion (optical flow) and semantic content features from both video and image data. Inter-task synergy arises from loss co-optimization and architectural coupling, improving localization within content features and sharpening flow predictions. The architecture achieves state-of-the-art or competitive results across optical flow and segmentation benchmarks and highlights the benefits of holistic, multi-task self-supervised representation learning (Bardes et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Motion-Content JEPA (MC-JEPA).

MC-JEPA: Joint Motion & Content Learning

1. Architectural Design and Components

2. Objective Functions and Optimization Strategies

3. Feature Interaction and Mutual Benefits

4. Experimental Protocols and Evaluation Benchmarks

5. Ablation Analyses and Design Sensitivities

6. Training Regimen and Augmentation Policies

7. Conclusion and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MC-JEPA: Joint Motion & Content Learning

1. Architectural Design and Components

2. Objective Functions and Optimization Strategies

3. Feature Interaction and Mutual Benefits

4. Experimental Protocols and Evaluation Benchmarks

5. Ablation Analyses and Design Sensitivities

6. Training Regimen and Augmentation Policies

7. Conclusion and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research