Masked Autoregressive Pretraining (MAP)

Updated 2 February 2026

Masked Autoregressive Pretraining (MAP) is a self-supervised paradigm that combines random masking with sequential prediction to capture complex intra-token dependencies.
MAP integrates techniques from Masked Image Modeling and autoregressive sequence modeling to overcome distribution gaps, enabling improved fine-tuning outcomes.
By leveraging architectural innovations in both Transformer and state-space networks, MAP achieves robust performance across diverse modalities including images, 3D, and video.

Masked Autoregressive Pretraining (MAP) defines a family of self-supervised representation learning paradigms for vision models, generalizing Masked Image Modeling (MIM) by integrating masking, autoregressive sequence modeling, and architectural innovations tailored to both Transformer and state-space (Mamba) networks. MAP objectives explicitly leverage random masking and sequential (autoregressive) prediction to capture complex intra-token dependencies and close the statistical gap between pretraining and fine-tuning or downstream transfer.

1. Foundations: From MIM and Permuted Prediction to MAP

Traditional Masked Image Modeling (MIM) pre-training, as typified by methods such as BEiT and SimMIM, involves randomly masking a subset of image patches and training a Vision Transformer (ViT) encoder to reconstruct the original discretized tokens of the masked regions. Formally, given an image $x \in \mathbb{R}^{H \times W \times C}$ , tokenized into $N$ patches ${v_1, \dots, v_N} = \mathcal{T}(x)$ via a tokenizer $\mathcal{T}$ , and a mask set $M \subset \{1, ..., N\}$ , the objective is: $L_\text{MIM}(\theta) = - \sum_{x \in \mathcal{D}} \sum_{i \in M} \log p_\theta(v_i \mid x_\text{masked})$ where masked positions in $x_\text{masked}$ are replaced with a learnable MASK embedding (Baraldi et al., 2023).

While MIM is effective, it introduces two primary issues: masked tokens do not occur during fine-tuning (causing a distribution shift), and predictions for individual tokens are conditionally independent, resulting in under-modeled correlations between masked patches.

Permuted/Image permutation-based methods (PIM), inspired by the autoregressive XLNet approach, replace mask tokens by randomly permuting patch order and forcing the model to predict tokens sequentially, but at the cost of losing consistent positional context and introducing a new form of distribution shift (Baraldi et al., 2023).

MAP, in both image and hybrid backbone settings, resolves these limitations by (1) applying autoregressive prediction over randomly masked token sequences and (2) injecting absolute position cues through explicit positional embeddings, thus eliminating input distribution discrepancies and enabling modeling of inter-patch dependencies (Baraldi et al., 2023, Liu et al., 2024).

2. Formal MAP Learning Objectives

The MAP training objective unifies the local reconstructive capabilities of MAE with global sequential order modeling characteristic of AR approaches. For a generic hybrid Mamba-Transformer network, let $\mathcal{M} \subset \{1, \dots, N\}$ denote the mask of proportion $p$ (typically $p=0.5$ ), and let $N$ 0 represent the encoder output over visible patches. The decoder autoregressively reconstructs masked patches with a row-wise ordering given by row sets $N$ 1. The model predicts each $N$ 2 for $N$ 3 as: $N$ 4 The loss is the normalized mean squared error (MSE): $N$ 5 This formulation supports both image and 3D input modalities and allows continuous or discrete target prediction (Liu et al., 2024).

The alternative permutation-based Masked Autoregressive objective for ViT backbones, as in MaPeT, is: $N$ 6 where $N$ 7 denotes positional mask tokens for future slots, ensuring the model retains absolute spatial information throughout (Baraldi et al., 2023).

For video, the frame-wise masked autoregressive objective (as in VideoMAP) generalizes MAP to temporal context: $N$ 8 where the model autoregressively predicts CLIP feature embeddings $N$ 9 of frame ${v_1, \dots, v_N} = \mathcal{T}(x)$ 0 from previous masked frames ${v_1, \dots, v_N} = \mathcal{T}(x)$ 1 (Liu et al., 16 Mar 2025).

3. Architectural Innovations and Adaptations

MAP is uncoupled from any single network family; its developments span pure ViT, hybrid Mamba–Transformer, and video-specific backbones.

Image Domains

Hybrid Mamba–Transformer Architecture: Interleaving State-Space Mamba blocks and Transformer blocks (e.g., a "MMMT" pattern) enables bidirectional long-context modeling (Mamba) and local attention (Transformer), maintaining full patch resolution throughout all layers (Liu et al., 2024).
Two-Stream Self-Attention and Positional Side-Paths: As in MaPeT, a two-stream structure per layer maintains a content and a query stream, the latter being attention-masked to only see prior tokens and positional placeholders, thus forbidding information leakage from unpredicted target patches (Baraldi et al., 2023).
Autoregressive Row-Wise Decoder: The Transformer decoder enforces sequential generation on a row-wise basis, consistent with the preferred scanning order of state-space blocks and empirical ablation results (Liu et al., 2024).

Video MAP

Hybrid Video Encoders: VideoMAP employs a 4:1 interleaving of Mamba and Transformer blocks, balancing computational efficiency (O(N) in Mamba) with global modeling (O(N²) via periodic Transformer layers). Different encoder widths—VideoMAP-M/B/L—are obtained by scaling both depth and embedding dimension (Liu et al., 16 Mar 2025).
Frame-wise AR Loss: An autoregressive decoder observes all representations up to frame ${v_1, \dots, v_N} = \mathcal{T}(x)$ 2 to predict the CLIP embedding of frame ${v_1, \dots, v_N} = \mathcal{T}(x)$ 3 (Liu et al., 16 Mar 2025).

4. Masking Strategies and Target Tokenization

Effective masking and prediction targets are crucial for the success of MAP.

Random Masking: Randomly masking 50% of patches is empirically optimal over diagonal or blockwise masks for both images and point clouds, improving context diversity and reconstruction challenge (Liu et al., 2024).
Row-Wise AR Mask: The decoder attention pattern enforces prediction of all tokens in row ${v_1, \dots, v_N} = \mathcal{T}(x)$ 4 only after all rows ${v_1, \dots, v_N} = \mathcal{T}(x)$ 5 have been predicted. Ablations confirm this order (rather than purely sequential or blockwise AR) yields superior results—improving accuracy by up to +2.9% in some hybrid settings (Liu et al., 2024).
Target Construction
- Pixel Patches: For general hybrid and Mamba-based models, normalized original pixel patches are used as reconstruction targets.
- Discrete k-CLIP Tokens: In MaPeT, CLIP visual features are discretized via k-means (k=8192), a process bypassing the need for an auxiliary VAE as in DALL·E tokens and aggregating language-aligned semantics. The resulting discretized indices form the prediction targets, supporting dataset-agnostic and transferable training (Baraldi et al., 2023, Liu et al., 16 Mar 2025).
- CLIP Embeddings (VideoMAP): For video, the prediction target is the global CLIP embedding of the next frame, emphasizing both spatial and semantic continuity (Liu et al., 16 Mar 2025).

5. Experimental Protocols and Results

MAP has been validated extensively in both 2D and 3D vision and video representation learning.

ImageNet-1K and 3D ShapeNet

Training Details: MAP pretraining is performed on random crops of ImageNet-1K (or ShapeNet for 3D), mask ratio 0.5, AdamW optimizer (weight decay 0.05), cosine learning rate schedule. Fine-tuning uses standard augmentations and longer schedules (e.g., 1600 epochs pretraining, 400 for fine-tuning) (Liu et al., 2024).
Benchmark Results:
- On ImageNet-1K (224x224), a HybridMH-B+MAP model achieves 84.9% top-1 accuracy, surpassing both MAE- and AR-pretrained baselines of comparable size (by +1.8%) (Liu et al., 2024).
- In point cloud classification, HybridMT3D+MAP reaches 95.4% on ModelNet40 and 92.95% on ScanObjectNN, outperforming hybrid and Mamba3D-only variants without MAP (Liu et al., 2024).
- MaPeT yields 84.4% top-1 (ViT-Base, VQ-KD tokenizer), exceeding CAE and BEiT and matching models trained for 2–5× longer (Baraldi et al., 2023).
Video Understanding:
- VideoMAP-L (301M params) achieves 88.3% on Kinetics-400, scaling stably to high capacity, and demonstrates strong sample efficiency (100 epochs MAP matches 800 epochs of non-hybrid pretraining). Memory footprint for deep video models is reduced by up to 40% with MAP-integrated hybrid encoders (Liu et al., 16 Mar 2025).

6. Ablation Studies and Analysis

MAP’s design choices have been dissected via systematic ablations:

Masking Pattern: Random, 50% masking is optimal over diagonal or sequential variants.
Decoder AR Order: Row-wise AR decode matches the scanning preference of Mamba and delivers +2.9% gains in accuracy for hybrid networks (Liu et al., 2024).
Hybrid Block Placement: Patterns such as "MMMT ×8" maximize accuracy with minimal parameter growth.
Reconstruction Target: Pure MSE loss outperforms alternatives (e.g., diffusion or adversarial losses) by a notable margin.
Application Breadth: MAP offers tangible gains over MAE and AR regardless of backbone (pure ViT, pure Mamba, hybrid), with additional cross-domain robustness (cars, food, aircraft datasets), and can extend naturally to 3D modalities without special adaptations (Baraldi et al., 2023, Liu et al., 2024).

7. Significance and Impact

MAP establishes a versatile pretraining paradigm, unifying local reconstruction and sequential prediction with masking for both standard vision transformers and emerging hybrid backbones. By bridging masking and autoregressive learning while retaining explicit positional context, MAP resolves both key MIM limitations—independent predictions and input distribution shift at fine-tuning—and supports sample- and computationally efficient scalability to large models, multi-modalities (2D/3D/video), and diverse downstream tasks. Experimental results confirm that MAP-based models set new state-of-the-art performance under comparable model and data regimes, and architectural innovations such as the Mamba–Transformer hybrid with MAP enable robust and memory-efficient vision encoders for modern large-scale applications (Liu et al., 2024, Liu et al., 16 Mar 2025, Baraldi et al., 2023).

Table: Performance Summary on Major Benchmarks

Model/Setting	Top-1 Acc. (ImageNet)	ModelNet40	ScanObjNN	Params
HybridMH-B+MAP (224px)	84.9%	–	–	128M
ViT-B+MAE	83.6%	–	–	86M
Mamba3d+MAP	–	95.1%	92.65%	–
HybridMT3D+MAP	–	95.4%	92.95%	–
MaPeT (ViT-B, VQ-KD)	84.4%	–	–	–
VideoMAP-L (Kinetics-400)	88.3% (video)	–	–	301M

MAP’s influence spans both theoretical advancement—closing the train/test gap and enhancing token dependency modeling—and practical utility—enabling robust, resource-efficient, and high-performing pretraining for diverse vision architectures (Liu et al., 2024, Baraldi et al., 2023, Liu et al., 16 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training (2023)

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining (2024)

VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Masked Autoregressive Pretraining (MAP).