Papers
Topics
Authors
Recent
Search
2000 character limit reached

MC-JEPA: Joint Motion & Content Learning

Updated 27 January 2026
  • The paper demonstrates that unifying optical flow estimation and content discrimination using a shared encoder leads to mutual improvements across motion and segmentation benchmarks.
  • MC-JEPA integrates PWC-Net–style flow estimators with VICReg-based embedding to co-optimize multi-scale features from both video and still images.
  • Experiments reveal competitive performance with state-of-the-art unsupervised methods, validating the effectiveness of joint multitask self-supervised learning.

Motion-Content JEPA (MC-JEPA) is a self-supervised joint-embedding predictive architecture designed to unify the learning of optical flow (motion) and semantic content features within a single model. Unlike previous approaches that independently addressed either motion estimation or content discrimination, MC-JEPA employs a shared encoder to enable direct co-optimization, resulting in mutual improvements for both tasks. The architecture achieves performance on par with established unsupervised optical flow methods and leading self-supervised learning (SSL) techniques on major benchmarks for semantic segmentation and video analysis (Bardes et al., 2023).

1. Architectural Design and Components

MC-JEPA utilizes a single ConvNeXt-T–based encoder, denoted EθE_\theta, which integrates the processing of both video and image data streams. Given input in the form of either consecutive video frames (It,It+1)(I_t, I_{t+1}) or two randomly augmented crops from a still image, EθE_\theta computes a pyramid of multi-scale feature maps:

X(l):=Eθ(It)Rd(l)×h(l)×w(l),l=1L,X^{(l)} := E_\theta(I_t) \in \mathbb{R}^{d^{(l)} \times h^{(l)} \times w^{(l)}}, \quad l = 1\ldots L,

where L=6L=6 levels span from coarse to near-pixel resolution.

The joint embedding space branches into:

  • Motion (Flow) Branch: At each scale, PWC-Net–style flow estimators FϕF_\phi compute residual optical flow updates. The complete flow is progressively computed coarse-to-fine:

ftt+1(1)=Fϕ(Xt(1),Xt+1(1),0),ftt+1(l+1)=Fϕ(Xt(l+1),Xt+1(l+1),ftt+1(l)).f_{t\to t+1}^{(1)} = F_\phi(X_t^{(1)}, X_{t+1}^{(1)}, 0), \quad f_{t\to t+1}^{(l+1)} = F_\phi(X_t^{(l+1)}, X_{t+1}^{(l+1)}, f_{t\to t+1}^{(l)}).

Feature maps at t+1t+1 are reconstructed by warping Xt(l)X_t^{(l)} with the estimated flow via bilinear sampling.

  • Content (Self-Supervised) Branch: Two augmented versions v1,v2v_1, v_2 of a still image yield Eθ(v1),Eθ(v2)E_\theta(v_1), E_\theta(v_2), which are global-pooled and passed to a three-layer expander network gψg_\psi to produce 8192-dimensional embeddings z1,z2z_1, z_2. These are aligned using the VICReg variance–invariance–covariance objective.

The architectural coupling, with all heads sharing EθE_\theta, fosters precise alignment and information exchange between motion and content feature learning (Bardes et al., 2023).

2. Objective Functions and Optimization Strategies

MC-JEPA’s training leverages multi-term loss functions across the two branches and integrates them within a unified multitask loss:

  • Optical Flow Losses:
    • Feature Regression: Multi-scale feature warping and regression,

    Lreg=l=1LXt+1(l)W(Xt(l),ftt+1(l))22\mathcal{L}_{\mathrm{reg}} = \sum_{l=1}^L \| X_{t+1}^{(l)} - W(X_t^{(l)}, f_{t\to t+1}^{(l)}) \|_2^2 - Photometric Reconstruction: Mixing 1\ell_1, 2\ell_2 and SSIM criteria,

    Lrec=d(It+1,W(It,ftt+1)),  d=λ11+λ222+λ3SSIM()\mathcal{L}_{\mathrm{rec}} = d\bigl(I_{t+1},\,W(I_t,\,f_{t\to t+1})\bigr), \; d = \lambda_1 \|\cdot\|_1 + \lambda_2 \|\cdot\|_2^2 + \lambda_3 \mathrm{SSIM}(\cdot) - Edge-Aware Smoothness: Penalizes inconsistent local flow regularized by image gradients,

    Lsmooth=pd{x,y}eλdI(p)dftt+1(p)1\mathcal{L}_{\mathrm{smooth}} = \sum_p \sum_{d \in \{x, y\}} e^{-\lambda |\nabla_d I(p)|} \| \nabla_d f_{t\to t+1}(p) \|_1 - Cycle Consistency: Encourages forward-backward flow consistency,

    Lcycle=l=1LXt(l)W(W(Xt(l),ftt+1),ft+1t)22\mathcal{L}_{\mathrm{cycle}} = \sum_{l=1}^L \| X_t^{(l)} - W(W(X_t^{(l)}, f_{t\to t+1}), f_{t+1\to t}) \|_2^2 - Variance–Covariance Regularization: Per-layer stabilization term,

    Lvc=l=1L{1d(l)j=1d(l)max(0,γVar(Xt,j(l)))+1d(l)ij[C(Xt(l))]i,j2}\mathcal{L}_{vc} = \sum_{l=1}^L \Bigg\{ \frac{1}{d^{(l)}}\sum_{j=1}^{d^{(l)}} \max\Big(0,\,\gamma-\sqrt{\mathrm{Var}(X_{t,j}^{(l)})}\Big) + \frac{1}{d^{(l)}}\sum_{i\neq j} [C(X_t^{(l)})]_{i,j}^2 \Bigg\}

  • Content (VICReg) Loss:

    • Invariance: Alignment of view embeddings,

    Linv=z1z222\mathcal{L}_{\mathrm{inv}} = \|z_1 - z_2\|_2^2 - Variance: Ensures embedding spread,

    Lvar=1Djmax(0,γVar(z:,j)+ε)\mathcal{L}_{\mathrm{var}} = \frac{1}{D}\sum_j \max(0,\,\gamma - \sqrt{\mathrm{Var}(z_{:,j})} + \varepsilon) - Covariance: De-correlates output dimensions,

    Lcov=1Dij[C(z)]i,j2\mathcal{L}_{\mathrm{cov}} = \frac{1}{D} \sum_{i \neq j} [C(z)]_{i,j}^2 - Combined content loss: Lssl=Linv+λvarLvar+λcovLcov\mathcal{L}_{\mathrm{ssl}} = \mathcal{L}_{\mathrm{inv}} + \lambda_{\mathrm{var}} \mathcal{L}_{\mathrm{var}} + \lambda_{\mathrm{cov}} \mathcal{L}_{\mathrm{cov}}

  • Total Joint Objective: For each iteration, batches are drawn from video (for flow) and ImageNet (for SSL), and

Ltotal=Lreg+Lrec+Lsmooth+Lcycle+Lvc,flow+αLssl\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{reg}} + \mathcal{L}_{\mathrm{rec}} + \mathcal{L}_{\mathrm{smooth}} + \mathcal{L}_{\mathrm{cycle}} + \mathcal{L}_{vc,\,\mathrm{flow}} + \alpha \mathcal{L}_{\mathrm{ssl}}

with α0.1\alpha \approx 0.1 yielding the best trade-off (Bardes et al., 2023).

3. Feature Interaction and Mutual Benefits

Backpropagation through the shared encoder EθE_\theta ensures strong interaction between the flow and content objectives:

  • Flow-based gradients promote the preservation of high-frequency and spatially local features, benefiting pixel-level localization within semantic embeddings.

  • VICReg invariance discourages overfitting to motion artifacts, driving the encoder toward semantic attributes while maintaining compatibility with accurate flow estimation.

  • The variance–covariance regularizer stabilizes representation learning, mitigating conflicts between motion and semantic gradients.

  • The architecture is optimized via joint multitask objectives rather than sequential training, enhancing co-adaptation.

Empirical analysis demonstrates that introducing the VICReg loss into pure motion learning improves final flow metrics, while supplementing standard VICReg with a flow branch elevates segmentation accuracy (e.g., VOC mIoU from 60.1 to 67.1) (Bardes et al., 2023).

4. Experimental Protocols and Evaluation Benchmarks

MC-JEPA employs comprehensive pretraining and evaluation strategies, with datasets and metrics spanning both motion and content domains:

  • Pretraining Datasets:

    • Flow: FlyingChairs, FlyingThings, KITTI raw/multiview (2012, 2015), MPI Sintel (raw, clean, final), HD1K.
    • Content: ImageNet-1k.
  • Flow Evaluation: MPI Sintel (clean/final, average EPE), KITTI 2015 (EPE, F1 >3px error).
  • Segmentation Benchmarks: Pascal VOC, Cityscapes, ADE20K—linear probe and fine-tuned, measured by mIoU.
  • Video Analysis: DAVIS 2017—mean region similarity JmJ_m, contour accuracy FmF_m, and their average.

See Table 1 for comparative results:

Method KITTI EPE ↓ Sintel EPE ↓ VOC mIoU ↑ Cityscapes mIoU ↑ ADE20K mIoU ↑ DAVIS (JF) ↑
UFlow (PWC) 11.13 6.50
UPFlow 9.38 5.32
SMURF (RAFT) 6.83 4.18
VICRegL 79.7 78.3 44.1 66.7
DINO 79.5 78.1 43.5 69.9
MC-JEPA 11.33% F₁ 6.12 79.9 78.4 44.2 70.5

MC-JEPA demonstrates flow performance comparable to dedicated unsupervised methods and matches or outperforms leading SSL models in segmentation and DAVIS video analysis (Bardes et al., 2023).

5. Ablation Analyses and Design Sensitivities

Extensive ablation studies highlight the sensitivity and interdependence of losses, architectures, and data strategies:

  • Incorporating VICReg alongside flow improves flow EPE, while adding the flow head to VICReg baselines increases segmentation mIoU.
  • LayerNorm in the PWC head is essential to avoid gradient/weight instabilities; l2l_2-normalization of flow output degrades optical flow quality.
  • Alternate batch/epoch sampling harms joint training performance; batchwise combined-loss sampling is optimal (e.g., flow EPE 2.67, VOC 67.1, DAVIS 70.5).
  • Flow training is most effective when introduced after \sim10 SSL epochs.
  • Optimal cycle-consistency and multi-task trade-off weights (\sim0.2 and \sim0.1, respectively) are critical; misconfiguration degrades both flow and segmentation results.
  • Pretraining the variance–covariance regularizer for 1 epoch prior to full multitask optimization slightly improves stability and accuracy.

6. Training Regimen and Augmentation Policies

The model is trained on 8× Tesla V100 GPUs for 100 epochs (∼3–4 days). The optimizer is AdamW, with (β1,β2)=(0.9,0.999)(\beta_1, \beta_2) = (0.9, 0.999) and weight decay =106=10^{-6}. Learning rates are set as 3×1043 \times 10^{-4} for the encoder and VICReg head, and 10410^{-4} for the flow head (starting after 10 epochs), both with cosine decay and a 10-epoch warm-up. Batch sizes are 384 for SSL and 8 for flow tasks. Data augmentations for ImageNet comprise random crops/scales ([0.08,1.0]), resizing to 224×224224 \times 224, color jitter, and Gaussian blur; flow datasets use dataset-specific fixed resolutions and standard geometric/color augmentations.

7. Conclusion and Significance

MC-JEPA establishes that a single joint-embedding predictive architecture can co-learn high-quality motion (optical flow) and semantic content features from both video and image data. Inter-task synergy arises from loss co-optimization and architectural coupling, improving localization within content features and sharpening flow predictions. The architecture achieves state-of-the-art or competitive results across optical flow and segmentation benchmarks and highlights the benefits of holistic, multi-task self-supervised representation learning (Bardes et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Motion-Content JEPA (MC-JEPA).