MAE-ViT: Masked Autoencoding Video Transformer

Updated 21 January 2026

The paper introduces a MAE-based transformer that reconstructs masked video patches, achieving state-of-the-art accuracy on diverse video benchmarks.
It employs an encoder-decoder structure where the encoder processes only visible tokens, drastically reducing computation via extreme masking ratios.
The approach demonstrates robust self-supervised learning across large unlabeled datasets, enabling efficient and accurate video representation learning.

A Masked Autoencoding Video Vision Transformer (MAE-ViT) refers to a self-supervised learning framework that extends masked autoencoding principles—established in the context of natural language and image modeling—to dense spatiotemporal visual signals using transformer architectures. This paradigm centers on randomly masking large portions of input video patches (or tubelets), training a transformer-based encoder-decoder architecture to reconstruct the missing content from the visible context. The approach is characterized by extremely high masking ratios, parameter-efficient architectures, and an emphasis on domain-agnostic learning across diverse video datasets, yielding state-of-the-art performance across video understanding tasks without the need for explicit supervision (Tong et al., 2022, Feichtenhofer et al., 2022, Girdhar et al., 2022).

1. Core Architectural Principles

The canonical MAE-ViT framework consists of the following components:

Patchification (Tubelet Embedding): An input video of shape $T \times H \times W \times 3$ is partitioned into spatiotemporal tubelets (e.g., $2 \times 16 \times 16 \times 3$ for 2 consecutive frames at $16 \times 16$ spatial resolution), yielding $N = (T/t) \cdot (H/h) \cdot (W/w)$ tokens per clip. For images, $T=1$ is treated as a special case (Girdhar et al., 2022, Tong et al., 2022).
Encoder: Only the unmasked subset of $N-M$ visible tokens is processed by a plain Vision Transformer (ViT-B/L/H). No explicit temporal or cross-frame constraints are imposed: standard ViT attention is applied across the sequential spatiotemporal tokens, with positional embeddings encoding temporal and spatial order (Girdhar et al., 2022, Feichtenhofer et al., 2022).
Decoder: A lightweight, often shallow transformer reconstructs the masked tokens. It operates on the concatenation of encoder outputs and $M$ learned mask tokens, adding positional encodings for all $N$ slots before predicting target RGB values for each tubelet (Girdhar et al., 2022, Tong et al., 2022).
Temporal Handling: Time is integrated as an additional patch dimension; there is no specialized cross-frame or motion-specific mechanistic bias in the base architecture (Girdhar et al., 2022, Feichtenhofer et al., 2022). This design reflects the minimalism of the self-supervised framework, where all inductive biases are provided by the masking configuration and transformer’s global attention.

2. Masking Paradigms, Pretext Task, and Reconstruction Objective

Extreme Spatiotemporal Masking: MAE-ViT employs unusually high spatiotemporal masking ratios—for video, typically 90%–95% of all tubelets are masked, with only a sparse set of visible tokens processed. For images, 75%–90% masking is commonly used. Masking is applied randomly by default, sampling each tubelet independently, although alternatives (tube, frame, or causal masking) have been explored: random masking consistently outperforms structured schemes at these high ratios (Girdhar et al., 2022, Tong et al., 2022, Feichtenhofer et al., 2022).
Efficient Pretraining: Masking only the decoder’s input allows the compute-intensive encoder to process a fraction of the data ( $N-M$ tokens), yielding $\sim4\times$ – $12\times$ reduction in FLOPs and wall-clock time (Tong et al., 2022, Feichtenhofer et al., 2022, Girdhar et al., 2022).
Pixel Reconstruction Loss: The pretext is pure pixel-space masked tubelet reconstruction. For each masked tubelet $i$ , with ground truth $x_i \in \mathbb{R}^{t\cdot h\cdot w\cdot 3}$ and model prediction $\hat{x}_i$ , mean-squared error (MSE) loss is applied:

$\mathcal{L}_{\mathrm{rec}} = \frac{1}{|M|}\sum_{i\in M} \|x_i - \hat x_i\|^2$

Optionally, targets are normalized to zero mean and unit variance per patch (Girdhar et al., 2022, Tong et al., 2022).

No Contrastive or Classification Losses: The objective is entirely local pixel denoising; no contrastive, ranking, or classification losses are used at pretrain (Girdhar et al., 2022, Feichtenhofer et al., 2022).

3. Training Protocols and Implementation Details

Self-Supervised Datasets: Pretraining is performed on large-scale, unlabeled video datasets such as Kinetics-400, Something-Something v2, and (occasionally) real-world uncurated corpora (Feichtenhofer et al., 2022, Girdhar et al., 2022).
Hyperparameters:
- Optimizer: AdamW (weight decay $0.05$, $\beta_1 = 0.9$ , $\beta_2 = 0.95$ )
- LR and Schedule: Typical peak learning rates: $1.5 \times 10^{-4}$ – $3 \times 10^{-4}$ , cosine decay, linear warmup for initial epochs
- Batch Size: $1024$–$2048$, with video sample replication (often $4\times$ ) to alleviate data loading bottlenecks (Girdhar et al., 2022).
- Epochs: Pretraining generally for $800$–$2400$ epochs
Data Handling: For video, $T=16$ frames are sampled per clip at moderate stride (e.g., $6$ FPS), spatially resized to $224 \times 224$ ; for images, center-cropped at $224 \times 224$ (Girdhar et al., 2022).
Decoder Dimensions: Decoder is usually much smaller (depth $4$–$8$ layers, dimension $384$–$512$) than the encoder, limiting parameter overhead and focusing model capacity on the encoder’s representations (Girdhar et al., 2022, Tong et al., 2022).

4. Empirical Results and Ablations

A unified MAE-ViT achieves state-of-the-art or competitive performance across a spectrum of video benchmarks:

Model	SSv2 Top-1	K400 Top-1	UCF101	HMDB51
VideoMAE-B	70.8%	80.0%	91.3%	62.6%
VideoMAE-L	74.0–75.4%	82.8–87.4%	—	—
OmniMAE-B	69.0%	80.8%	—	—
OmniMAE-L	74.2%	84.0%	—	—
OmniMAE-H	75.5%	85.4%	—	—

Mask Ratio Ablations: Performance peaks at 90% (videos) and remains robust up to 95%, but degrades above that threshold (Girdhar et al., 2022, Tong et al., 2022). For images, best results are at 75%–90%.
Mask Type Ablations: Random mask sampling across both space and time yields the best accuracy, outperforming tube, frame, and causal mask types, especially at high mask rates (Girdhar et al., 2022, Feichtenhofer et al., 2022).
Decoder Size: Empirically, $4$ layers and $384$–$512$ hidden dimensions are optimal; expanding depth or size does not yield further gains (Tong et al., 2022, Feichtenhofer et al., 2022).
Data Efficiency: MAE-ViT demonstrates remarkable data efficiency, attaining strong transfer even on small datasets (UCF101, HMDB51) and showing robustness under domain shift (Tong et al., 2022, Feichtenhofer et al., 2022).

5. Extensions, Variants, and Generalizations

Several developments have built upon the MAE-ViT paradigm:

Unified Modal Training: OmniMAE demonstrates that a single backbone can be pretrained jointly on images and videos using identical transformers and objectives, treating images as $T=1$ videos. This enables shared training and downstream adaptation across modalities with a single set of parameters (Girdhar et al., 2022).
Motion-Aware and Multi-Target Variants: Related architectures incorporate explicit motion prediction targets (e.g., as in MotionMAE) or leverage auxiliary modalities such as optical flow, but canonical MAE-ViT relies solely on raw RGB reconstruction (see (Yang et al., 2022) for motion-aware variants).
Efficient and Dynamic Masking: Techniques such as EVEREST use information-centric token selection to mask only the most informative (e.g., high-motion) regions, further reducing computational cost while retaining accuracy (Hwang et al., 2022).
Blockwise Training: Recent work explores blockwise, locally supervised training (BWSSL) in place of global end-to-end backpropagation, revealing that local masked-reconstruction supervision can yield comparable final representations with unique depth-wise feature emergence (Römer et al., 14 Jan 2026).

6. Significance and Broader Impact

The masked autoencoding video transformer framework demonstrates that generic transformers, when equipped with spatiotemporal patchification and extreme random masking, can self-supervise video representation learning at scale without video-specific architectural biases or explicit temporal supervision. This approach matches or outperforms supervised and handcrafted self-supervised pipelines across multiple semantic and scene-understanding tasks, increases training efficiency by orders of magnitude, and is robust to noisy or uncurated data sources (Tong et al., 2022, Feichtenhofer et al., 2022, Girdhar et al., 2022). A key lesson is that the redundancy of video data enables aggressive masking, making self-supervised pixel denoising both feasible and highly effective for spatiotemporal modeling.

Ongoing work explores motion-specific targets, multi-modal and cross-view autoencoding, efficient masking via token scoring, and alternative optimization schemes, indicating a continued trajectory of MAE-ViT as the central backbone architecture for unified video representation learning.