Siamese & Crop-Based MAEs
- The paper introduces Siamese and CropMAEs, employing a two-branch masked autoencoding strategy that uses paired video frames or random crops to learn object-centric features.
- It leverages strict reconstruction via high masking ratios (95%-98.5%) and cross-attention within a Vision Transformer framework to optimize image representation.
- Empirical results show CropMAE offers improved scalability and faster pre-training on still images, matching or surpassing SiamMAE in key segmentation and propagation tasks.
Siamese and crop-based Masked Autoencoders (MAEs) are recent advances in self-supervised image encoder pre-training, building on the masked autoencoding paradigm and leveraging novel strategies for view generation, masking, and reconstruction. Notably, the Siamese Masked Autoencoder (SiamMAE) introduced object-centric pre-training using pairs of video frames, while Crop-based MAE (CropMAE) extends this approach using paired random crops from still images, removing the dependency on video data and increasing masking ratios. These methods target object-centric feature learning, efficient representation, and practical scalability within the context of Vision Transformer (ViT) architectures (Eymaël et al., 2024).
1. Architectural Principles: Two-Branch Masked Autoencoding
Both SiamMAE and CropMAE feature a two-branch (Siamese) design based on the Vision Transformer. The workflow involves:
- Generation of two separate “views” of the data—a pair of temporally-separated frames for SiamMAE, or two differently cropped images from the same base image for CropMAE.
- Each view is patchified (converted into a sequence of patches) and input to a shared-weight ViT encoder.
- The first view remains largely unmasked (reference view); the second view is heavily masked (reconstruction target).
- The shared transformer decoder processes (i) the visible plus masked tokens of the second view and (ii) the full-token sequence of the first view.
- The decoder alternates self-attention on the second view’s tokens with cross-attention into the reference tokens, facilitating the reconstruction of masked regions.
Let and denote the patch-token sequences for the two views. The encoder computes and . The decoder receives both and , using cross-attention between these streams to reconstruct the masked patches, which are linearly mapped to pixel values (Eymaël et al., 2024).
2. Data View Generation: Frames vs. Crops
The key methodological distinction between SiamMAE and CropMAE is in view generation:
- SiamMAE: Samples two frames from a video, with temporal gap drawn from a given range (e.g., 4–48 frames). The first frame () is reference; the second () receives 95% masking.
- CropMAE: Operates on individual still images , producing two views:
- (global): a large crop from
- (local): a smaller crop fully inside
- This process is formalized using prescribed crop area proportions and aspect ratios. An optional random horizontal flip is applied to both.
- 98.5% of 's patches are masked—leaving only 2 out of 196 visible for ViT/16.
Pseudocode excerpt for CropMAE view generation:
1 2 3 4 5 6 7 |
function make_crops(I): # Global-to-Local strategy V1 = RandomResizedCrop(I, area ∈ [a, c], aspect ∈ [3/4, 4/3]) V2 = RandomResizedCrop(V1, area ∈ [b, d], aspect ∈ [3/4, 4/3]) if rand() < 0.5: V1, V2 = flip_horizontal(V1), flip_horizontal(V2) return V1, V2 |
The Global-to-Local crop strategy consistently yields the best performance (Eymaël et al., 2024).
3. Masking Ratios and Reconstruction Objective
Masking is central in both strategies. The masking ratio is defined as with the total number of patches and the set of masked indices.
| Model | Mask Ratio | # Visible Patches (ViT/16) |
|---|---|---|
| SiamMAE | 0.95 | 9/196 |
| CropMAE | 0.985 | 2/196 |
Reconstruction trains a pure denoising autoencoder objective using mean squared (L₂) error over the masked patches:
No contrastive components or momentum encoders are included; the loss remains a strict reconstruction term (Eymaël et al., 2024).
4. Training Regimen and Hyperparameters
Both architectures share a comparable configuration for fair performance analysis:
- Encoder: ViT-S/16 (384-dim, 12 blocks, shared weights)
- Decoder: 4 layers (depth=4), model-dim=256, FFN=2048, alternates self- and cross-attention
- Mask ratios: 0.95 (SiamMAE), 0.985 (CropMAE)
- Optimizer: AdamW (, ), weight decay=0.05
- Learning rate: (cosine decay, 10-epoch warmup)
- Batch size: Effective 2048 tokens
- Epochs: 400 for direct comparison (SiamMAE originally used 2000 on K400)
- Data: Repeated sampling of video clips (SiamMAE) or single images (CropMAE)
CropMAE does not require video decoding, leading to substantial reductions in wall-clock pre-training time (Eymaël et al., 2024).
5. Empirical Results Across Benchmarks
Benchmarks include DAVIS '17 video object segmentation (), VIP semantic part propagation (mIoU), and JHMDB human-pose propagation ([email protected]):
| Pre-training (400 ep) | DAVIS ’17 () | VIP (mIoU) | JHMDB ([email protected]) |
|---|---|---|---|
| SiamMAE on K400 | 57.9 | 33.2 | 46.1 |
| CropMAE on K400 | 58.6 | 33.7 | 42.9 |
| CropMAE on IN Sub (images) | 60.4 | 33.3 | 43.6 |
Pre-training efficiency: CropMAE is 1.3 faster than SiamMAE on K400 and %%%%2829%%%% faster when trained on still images (IN Sub), due to avoidance of video decoding and increased masking (Eymaël et al., 2024).
6. Ablation Studies and Hyperparameter Effects
CropMAE ablation studies, all on DAVIS '17 after 400 image-based pre-training epochs (IN Sub), reveal:
- Cropping strategy: Random (60.0), Global-to-Local (60.4), Same-view (36.6), Local-to-Global (55.9). Global-to-Local yields best results.
- Mask ratio: Peak performance at 98.5% (60.4); suboptimal at lower (e.g., 75%, 45.3) and higher (99%, 58.6) ratios.
- Decoder depth/dim: Shallow (depth=2, 59.1), default (depth=4, 60.4), deep (depth=8, 57.0); low dimension sufficient (256, 60.4).
- Data augmentation: Horizontal flip only is optimal (60.4); color jitter decreases performance (~56.2).
This suggests that nearly full masking and spatially-informative view generation are critical to the observed performance of CropMAE (Eymaël et al., 2024).
7. Insights, Comparative Analysis, and Practical Implications
A central finding is that explicit object motion is not necessary for object-centric feature emergence: the Siamese reconstruction task with spatially related crops suffices for models to learn “where” a cropped region is located within context and “how” to use reference information for reconstruction. Consequently, CropMAE matches or surpasses SiamMAE without reliance on temporal cues or video data.
Further, the local crop’s strict containment within the global view ensures that reconstruction can proceed via spatial propagation alone, with no need for high-level generative hallucination. The 98.5% masking ratio—the highest demonstrated for effective image reconstruction—forces utilization of long-range image dependencies and results in strong object-centric representations.
Plausable implications include more scalable pre-training pipelines using vast still-image archives and adoption of crop-based MAEs in lieu of video-based methods when efficiency and accessibility are priorities. This paradigm also demonstrates that implicit image transformations (from cropping) can encode the same inductive biases as temporal motion for object-centric self-supervised learning (Eymaël et al., 2024).