Papers
Topics
Authors
Recent
Search
2000 character limit reached

Masked Video Prediction (MVP)

Updated 14 January 2026
  • Masked Video Prediction (MVP) is a self-supervised learning approach aimed at reconstructing occluded spatio-temporal regions using diverse masking strategies and tokenization.
  • It leverages transformer architectures, convolutional backbones, and recurrent modules to capture appearance details, temporal dynamics, and motion priors effectively.
  • Recent developments in MVP demonstrate improved action recognition, video generation, and dense prediction while ensuring efficient compute, scalability, and robust temporal reasoning.

Masked Video Prediction (MVP) denotes a broad paradigm in which a model is tasked with reconstructing masked spatio-temporal regions (patches, frames, or segments) of videos given partial or contextually visible information. MVP unifies objectives in self-supervised representation learning, generative modeling, and post-training for foundation models, with masking strategies and reconstruction targets serving as the critical axis for capturing temporal structure, appearance, and motion priors.

1. Foundational Formulations and Model Architectures

MVP targets the recovery of occluded regions in input video data, typically via:

Model architectures span:

Key developments have focused on the interface between intra-frame reconstruction (appearance, texture) and inter-frame modeling (temporal causality, dynamics), as realized in hybrid approaches such as MAGI’s masked-autoregressive generation (Zhou et al., 21 Jan 2025).

2. Masking Strategies and Conditioned Reconstruction

The masking schema determines the degree and nature of temporal reasoning required. Notable configurations include:

  • Random masking (independent per patch): Simple but often too easy, easily circumvented by copying local information. Block-wise or tube masking is favored for temporally-correlated contexts (Tan et al., 2021, Wei et al., 2021, Sun et al., 2022).
  • Block-wise spatio-temporal masking: VIMPAC and MaskFeat introduce masking contiguous spatio-temporal cubes, forcing models to recover broader dependencies and discouraging trivial per-frame reconstructions (Tan et al., 2021, Wei et al., 2021).
  • Complete Teacher Forcing (CTF) vs. Masked Teacher Forcing (MTF): MAGI demonstrates that conditioning masked frames on fully observed contexts during training (CTF) yields +23% FVD improvement in UCF-101 experiments over masking context frames (MTF), aligning training and inference distributions (Zhou et al., 21 Jan 2025).

Some advanced strategies use reinforcement learning to learn adaptive masking policies, focusing masking on motion-centric (high-dynamics) tokens through Proximal Policy Optimization (PPO) (Rai et al., 13 May 2025), or employ motion priors to guide token selection (Sun et al., 2022).

3. Training Objectives, Losses, and Temporal Reasoning

Reconstruction targets define the MVP prediction task:

Reinforcement learning objectives with custom rewards (sequence ordering, partial correctness bonuses) are crucial for LLM-based video reasoning tasks (Sun et al., 7 Jan 2026). Curriculum strategies (dynamic interval or noise injection) reduce exposure bias, improving long-horizon coherence (Zhou et al., 21 Jan 2025).

4. Empirical Performance and Benchmark Results

Models are benchmarked across standard datasets and tasks:

  • Action recognition: VIMPAC, MaskFeat, VideoMAC, MME, and adaptive masking approaches show top-1 accuracy improvements over previous methods on SSV2, Kinetics-400, UCF101, HMDB51; e.g., MME achieves 81.8% top-1 on Kinetics-400, +2.3pp on SSV2 over VideoMAE (Sun et al., 2022), while PPO-guided token selection yields up to +15% over VideoMAE under 95% masking (Rai et al., 13 May 2025).
  • Video generation and prediction (FVD, SSIM, PSNR): MAGI’s CTF strategy delivers FVD=11.5 on Kinetics-600 (five-frame conditional) (Zhou et al., 21 Jan 2025), MCVD sets SOTA (FVD=23.9—25.6 on SMMNIST, 98.8–95.6 on BAIR) under unified block-wise autoregressive sampling (Voleti et al., 2022).
  • VideoLLM reasoning tasks: MVP fine-tuning yields notable gains (+5–8pp) in temporal reasoning and causal understanding benchmarks (LongVideoBench, MLVU, Video-Holmes) for QwenVL and InternVL LLM backbones (Sun et al., 7 Jan 2026).
  • Dense tasks (segmentation, propagation, tracking): VideoMAC demonstrates ConvNet MVP can outperform ViT-based MAEs by +5–6pp on DAVIS JdataF\mathcal{J}{data}\mathcal{F}, +6–11pp on VIP (mIoU) and JHMDB ([email protected]) (Pei et al., 2024); PLA-SM yields SSIM/PSNR improvements up to +1–2dB over strong baselines across diverse datasets (Li et al., 2023).

5. Efficiency, Scalability, and Practical Considerations

Efficient MVP design is critical for practical deployment:

  • Inference speed: MaskViT leverages iterative mask scheduling and windowed attention for up to 512Ă—\times real-time decoding speedup relative to autoregressive models (BAIR: T=3840 passes →\rightarrow 24 passes) (Gupta et al., 2022).
  • Memory and compute: RVM achieves comparable or stronger video understanding with up to 30Ă—\times smaller model sizes (RVM-S = 34M parameters), maintaining stable feature propagation over long horizons (Zoran et al., 15 Dec 2025).
  • KV-caching: MAGI exploits frame-level autoregression and caching, enabling nearly linear scaling of inference time (Zhou et al., 21 Jan 2025).
  • Data synthesis and curriculum: VideoLLM MVP utilizes scalable distractor generation and policy optimization to generate vast, diverse self-supervised samples (Sun et al., 7 Jan 2026).

Diffusion models (MCVD) highlight that block-wise, conditional generation with flexible masking can enable unified modeling for prediction, generation, and interpolation using simple 2D conv architectures at low compute cost (≤\leq4 GPUs, <<200 GPU-h), without 3D convs or recurrence (Voleti et al., 2022). Sparse convolutional encoding (VideoMAC, PLA-SM) is essential for preserving mask integrity—a failure mode of dense convolution (Pei et al., 2024, Li et al., 2023).

6. Extensions, Limitations, and Research Directions

Recent MVP work emphasizes:

  • Motion-centric supervision: Motion trajectory regression (MME) forces temporal reasoning, outperforming appearance-only objectives and yielding superior generalization (Sun et al., 2022).
  • Adaptive masking via RL: RL-based policies (TATS) steer masking toward high-dynamics regions, permitting 85–95% masking ratios without accuracy loss (Rai et al., 13 May 2025).
  • Generalist video encoders: RVM demonstrates parameter efficiency and domain-agnostic robustness without distillation, matching both image and video models for dense understanding and long-term tracking (Zoran et al., 15 Dec 2025).
  • Text-to-video and cross-modal MVP: MAGI and MaskViT can readily extend to text-conditioned video modeling by augmenting token embeddings with cross-attention (Zhou et al., 21 Jan 2025, Gupta et al., 2022).
  • Exposure bias: Dynamic interval and noise-based curricula, as in MAGI, partly mitigate exposure bias but drift remains in highly non-periodic sequences (Zhou et al., 21 Jan 2025).
  • Limitations: RL-based MVP (TATS) introduces training complexity (buffering, two-phase PPO), and current explorations on LLMs mostly address reasoning rather than perceptual quality (Rai et al., 13 May 2025, Sun et al., 7 Jan 2026). Long-horizon coherence in non-periodic content remains challenging for all frameworks.

Future work is poised to include multi-modal token selection, curriculum learning, adaptive scheduling, advanced reward design, and more explicit causal modeling for both generative and reasoning foundation models.

7. Comparative Summary Table of Representative MVP Frameworks

Framework Masking Strategy Core Objective Architecture
MAGI (Zhou et al., 21 Jan 2025) Frame-level (CTF) Cross-entropy (VAE+diffusion head); interval+noise curriculum Hybrid Transformer
MME (Sun et al., 2022) Tube/block-wise Motion trajectory regression ViT encoder-decoder
VIMPAC (Tan et al., 2021) Block-wise (VQ-VAE tokens) Masked token prediction + contrastive InfoNCE ViT transformer
VideoMAC (Pei et al., 2024) Symmetric frame-pair patches Dual reconstruction + consistency loss Sparse ConvNets
MaskViT (Gupta et al., 2022) Variable mask ratio, spatial/ST windows Codebook token cross-entropy, iterative decoding Windowed Transformer
PLA-SM (Li et al., 2023) Pixel-level input/feature masking MSE; PLA for texture U-shaped ConvNeXt + attention
MCVD (Voleti et al., 2022) Block-wise, random past/future frames Score matching (conditional diffusion) 2D Conv U-Net/DDPM
RVM (Zoran et al., 15 Dec 2025) Asymmetric future frame masking Pixel-level MSE; recurrent aggregation ViT + gated Transformer RNN
TATS (Rai et al., 13 May 2025) RL-learned motion-centric tokens MSE, PPO policy for masking ViT MAE + trajectory attention
MVP-VideoLLM (Sun et al., 7 Jan 2026) Masked continuous segment + distractors RL-group-wise ordering reward (GRPO) VideoLLM backbone, CLIP encoder

MVP thus encompasses a spectrum of strategies for video prediction, ranging from masked autoencoding of spatio-temporal patches to the explicit reconstruction of ordered frame segments under RL. The direction of recent work points toward greater token-adaptive masking, stronger temporal curricula, and unified generative/causal frameworks connecting perception and reasoning in high-level video models.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Video Prediction (MVP).