MC-JEPA: Joint Motion & Content Learning
- The paper demonstrates that unifying optical flow estimation and content discrimination using a shared encoder leads to mutual improvements across motion and segmentation benchmarks.
- MC-JEPA integrates PWC-Net–style flow estimators with VICReg-based embedding to co-optimize multi-scale features from both video and still images.
- Experiments reveal competitive performance with state-of-the-art unsupervised methods, validating the effectiveness of joint multitask self-supervised learning.
Motion-Content JEPA (MC-JEPA) is a self-supervised joint-embedding predictive architecture designed to unify the learning of optical flow (motion) and semantic content features within a single model. Unlike previous approaches that independently addressed either motion estimation or content discrimination, MC-JEPA employs a shared encoder to enable direct co-optimization, resulting in mutual improvements for both tasks. The architecture achieves performance on par with established unsupervised optical flow methods and leading self-supervised learning (SSL) techniques on major benchmarks for semantic segmentation and video analysis (Bardes et al., 2023).
1. Architectural Design and Components
MC-JEPA utilizes a single ConvNeXt-T–based encoder, denoted , which integrates the processing of both video and image data streams. Given input in the form of either consecutive video frames or two randomly augmented crops from a still image, computes a pyramid of multi-scale feature maps:
where levels span from coarse to near-pixel resolution.
The joint embedding space branches into:
- Motion (Flow) Branch: At each scale, PWC-Net–style flow estimators compute residual optical flow updates. The complete flow is progressively computed coarse-to-fine:
Feature maps at are reconstructed by warping with the estimated flow via bilinear sampling.
- Content (Self-Supervised) Branch: Two augmented versions of a still image yield , which are global-pooled and passed to a three-layer expander network to produce 8192-dimensional embeddings . These are aligned using the VICReg variance–invariance–covariance objective.
The architectural coupling, with all heads sharing , fosters precise alignment and information exchange between motion and content feature learning (Bardes et al., 2023).
2. Objective Functions and Optimization Strategies
MC-JEPA’s training leverages multi-term loss functions across the two branches and integrates them within a unified multitask loss:
- Optical Flow Losses:
- Feature Regression: Multi-scale feature warping and regression,
- Photometric Reconstruction: Mixing , and SSIM criteria,
- Edge-Aware Smoothness: Penalizes inconsistent local flow regularized by image gradients,
- Cycle Consistency: Encourages forward-backward flow consistency,
- Variance–Covariance Regularization: Per-layer stabilization term,
Content (VICReg) Loss:
- Invariance: Alignment of view embeddings,
- Variance: Ensures embedding spread,
- Covariance: De-correlates output dimensions,
- Combined content loss:
Total Joint Objective: For each iteration, batches are drawn from video (for flow) and ImageNet (for SSL), and
with yielding the best trade-off (Bardes et al., 2023).
3. Feature Interaction and Mutual Benefits
Backpropagation through the shared encoder ensures strong interaction between the flow and content objectives:
Flow-based gradients promote the preservation of high-frequency and spatially local features, benefiting pixel-level localization within semantic embeddings.
VICReg invariance discourages overfitting to motion artifacts, driving the encoder toward semantic attributes while maintaining compatibility with accurate flow estimation.
The variance–covariance regularizer stabilizes representation learning, mitigating conflicts between motion and semantic gradients.
The architecture is optimized via joint multitask objectives rather than sequential training, enhancing co-adaptation.
Empirical analysis demonstrates that introducing the VICReg loss into pure motion learning improves final flow metrics, while supplementing standard VICReg with a flow branch elevates segmentation accuracy (e.g., VOC mIoU from 60.1 to 67.1) (Bardes et al., 2023).
4. Experimental Protocols and Evaluation Benchmarks
MC-JEPA employs comprehensive pretraining and evaluation strategies, with datasets and metrics spanning both motion and content domains:
Pretraining Datasets:
- Flow: FlyingChairs, FlyingThings, KITTI raw/multiview (2012, 2015), MPI Sintel (raw, clean, final), HD1K.
- Content: ImageNet-1k.
- Flow Evaluation: MPI Sintel (clean/final, average EPE), KITTI 2015 (EPE, F1 >3px error).
- Segmentation Benchmarks: Pascal VOC, Cityscapes, ADE20K—linear probe and fine-tuned, measured by mIoU.
- Video Analysis: DAVIS 2017—mean region similarity , contour accuracy , and their average.
See Table 1 for comparative results:
| Method | KITTI EPE ↓ | Sintel EPE ↓ | VOC mIoU ↑ | Cityscapes mIoU ↑ | ADE20K mIoU ↑ | DAVIS (JF) ↑ |
|---|---|---|---|---|---|---|
| UFlow (PWC) | 11.13 | 6.50 | – | – | – | – |
| UPFlow | 9.38 | 5.32 | – | – | – | – |
| SMURF (RAFT) | 6.83 | 4.18 | – | – | – | – |
| VICRegL | – | – | 79.7 | 78.3 | 44.1 | 66.7 |
| DINO | – | – | 79.5 | 78.1 | 43.5 | 69.9 |
| MC-JEPA | 11.33% F₁ | 6.12 | 79.9 | 78.4 | 44.2 | 70.5 |
MC-JEPA demonstrates flow performance comparable to dedicated unsupervised methods and matches or outperforms leading SSL models in segmentation and DAVIS video analysis (Bardes et al., 2023).
5. Ablation Analyses and Design Sensitivities
Extensive ablation studies highlight the sensitivity and interdependence of losses, architectures, and data strategies:
- Incorporating VICReg alongside flow improves flow EPE, while adding the flow head to VICReg baselines increases segmentation mIoU.
- LayerNorm in the PWC head is essential to avoid gradient/weight instabilities; -normalization of flow output degrades optical flow quality.
- Alternate batch/epoch sampling harms joint training performance; batchwise combined-loss sampling is optimal (e.g., flow EPE 2.67, VOC 67.1, DAVIS 70.5).
- Flow training is most effective when introduced after 10 SSL epochs.
- Optimal cycle-consistency and multi-task trade-off weights (0.2 and 0.1, respectively) are critical; misconfiguration degrades both flow and segmentation results.
- Pretraining the variance–covariance regularizer for 1 epoch prior to full multitask optimization slightly improves stability and accuracy.
6. Training Regimen and Augmentation Policies
The model is trained on 8× Tesla V100 GPUs for 100 epochs (∼3–4 days). The optimizer is AdamW, with and weight decay . Learning rates are set as for the encoder and VICReg head, and for the flow head (starting after 10 epochs), both with cosine decay and a 10-epoch warm-up. Batch sizes are 384 for SSL and 8 for flow tasks. Data augmentations for ImageNet comprise random crops/scales ([0.08,1.0]), resizing to , color jitter, and Gaussian blur; flow datasets use dataset-specific fixed resolutions and standard geometric/color augmentations.
7. Conclusion and Significance
MC-JEPA establishes that a single joint-embedding predictive architecture can co-learn high-quality motion (optical flow) and semantic content features from both video and image data. Inter-task synergy arises from loss co-optimization and architectural coupling, improving localization within content features and sharpening flow predictions. The architecture achieves state-of-the-art or competitive results across optical flow and segmentation benchmarks and highlights the benefits of holistic, multi-task self-supervised representation learning (Bardes et al., 2023).