Vision MambaMixer (ViM2) Neural Architectures
- Vision MambaMixer (ViM²) is a family of state space modelābased neural architectures that use selective token and channel mixing for efficient visual data processing.
- It employs the novel HSM-SSD module to compress global context into a latent space, achieving up to 8Ć throughput improvements with competitive accuracy.
- ViM² models generalize across dimensions, enabling robust performance in 2D image tasks and 3D volumetric analyses for applications such as medical segmentation.
Vision MambaMixer (ViM²) refers to a family of state space model (SSM)ābased neural network architectures for visual data, most notably instantiated in EfficientViM-M2, V2M, and MobileViM. These designs systematically exploit the efficiency and content-adaptive expressiveness of SSMs, developing hierarchical, highly hardware-friendly models that scale to large vision tasks with favorable trade-offs of speed, memory, and accuracy. ViM² models consistently demonstrate advances over both attention-based Transformers and prior SSM architectures in linear complexity, selective channel and token mixing, and multi-dimensional generalization, with compelling results on classification, detection, segmentation, and 3D volumetric analysis.
1. Motivation and High-Level Design Principles
ViM² models are motivated by the limitations of quadratic-complexity attention and by the desire to extend recent SSM breakthroughsāespecially Mambaāto the multi-dimensional, hierarchical nature of visual data. Early Vision Mamba (ViM) (Zhu et al., 2024) replaced attention with bidirectional SSM blocks operating over flattened 1D image sequences, delivering linear cost and memory savings but incurring large projection costs per layer (with tokens and channels). ViM² architectures address the following key targets:
- Reduced computational footprint: EfficientViM-M2 (ViM²) restructures channel-mixing operations into a latent hidden-state space of size (ācompressed latentā), shifting major complexity from to āleading to up to 8 throughput improvements at comparable accuracy (Lee et al., 2024).
- Hierarchical context fusion and selective mixing: ViM² blocks include modules for dual token and channel mixing with data-dependent SSM recurrence, bidirectional or multi-directional scans to respect 2D/3D structure, and multi-stage feature aggregation for both low- and high-level signal preservation (Behrouz et al., 2024, Lee et al., 2024).
- Dimension independence for 3D vision: MobileViM generalizes these principles to 3D spatial layouts with dimension-agnostic and cross-scale MambaMix operations, supporting segmentation of volumetric medical imagery at real-time speeds (Dai et al., 19 Feb 2025).
2. Core Module: Hidden State Mixerābased State Space Duality (HSM-SSD)
The EfficientViM-M2 HSM-SSD layer is the linchpin of ViM²ās approach to token and channel mixing in vision. In contrast to the original NC-SSD formulation, which maintains costly projections, HSM-SSD compresses global context into a latent (with ), on which all subsequent gating and projections are performed:
Architecture-level data flow for HSM-SSD:
- Preprocessing: Input sequence undergoes linear projection [B, C, Ī, Z], then depthwise conv projections for local mixing.
- SSM Recursion: Discretized as , yielding a compressed hidden state .
- Channel Mixing in Hidden State: Replace standard and output projection with:
Compute , deferring the expensive projection until after compression.
- Cost Comparison:
| Layer | Original (NC-SSD) | HSM-SSD |
|---|---|---|
| Projections | ||
| SSM Mix |
In typical ViM², or , giving empirical ā runtime savings (Lee et al., 2024). Proposition 1 guarantees HSM-SSD recovers NC-SSD when and appropriate settings hold.
3. Multi-Stage Hidden State Fusion and Feature Aggregation
To maximally exploit intermediate hierarchical representations, ViM² aggregates class logits from every block/stageās final hidden state:
- For each stage , form summary and pass through a normalized classification head to yield .
- Learnable scalar weights (softmax-normalized) determine the fusion; the final prediction is .
- Empirically, this āmulti-stage fusionā increases ImageNet-1K top-1 by +0.3% with negligible compute (<1%) (Lee et al., 2024).
This approach improves both gradient flow and the expressiveness of learned features, as compared to using only the final stage.
4. Selective Token and Channel Mixing in ViM² Blocks
In MambaMixer-ViMā (Behrouz et al., 2024), each block includes:
- Selective Token Mixer (STM): Applies S6-based SSM scanning across image tokens in four diagonal directions (TLāBR, TRāBL, BLāTR, BRāTL). The SSM recurrence is performed after a depthwise 2D convolution and gating. The outputs are summed across all directions.
- Selective Channel Mixer (SCM): Implements a bidirectional SSM over the channel dimension, employing separate forward and backward passes with input-dependent weights, then recombining outcomes.
- Weighted Averaging of Earlier Features (WAEF): Each layerās token/channel mixer receives a learned weighted average of outputs from all previous mixers. This extension, reminiscent of DenseNet, improves deep gradient propagation and output calibration.
The complexity per block is linear in both sequence length and channel count, a significant reduction from quadratic-complexity standard self-attention.
5. Dimensional Generalization and Directional Scanning
Recent ViM² variants emphasize native -dimensional handling:
- Visual 2-Dimensional Mamba (V2M) (Wang et al., 2024): Constructs a 2D SSM (generalizing Roesserās model) where each token maintains sub-states for both horizontal and vertical dependencies. Efficiently implemented as pairs of 1D SSM scans (row and column-wise), with four directional sweeps (corresponding to the main axes and their rotations), these blocks maintain linear complexity in sequence length and preserve explicit 2D locality.
- MobileViM for 3D (Dai et al., 19 Feb 2025): Introduces a dimension-independent mixing mechanism for 3D arrays, performing Mamba scanning along depth, height, and width axes separately. Dual-directional scans further improve global context. Cross-scale skip connections (āScale Bridgerā) aggregate multi-resolution features, critical for accurate volumetric segmentation.
6. Empirical Results: Performance on ImageNet-1K, COCO, and Medical Volumes
ViM² models achieve state-of-the-art or highly competitive results in both efficiency and accuracy.
EfficientViM-M2 (āViM²ā, (Lee et al., 2024)) on ImageNet-1K:
- Top-1 (224² input): 75.8% (450 ep), throughput 17,005 img/s on RTX3090 (batch=256), 13.9M params, 355M FLOPs.
- Outperforms SHViT-S2 (75.2% @ 15,899 img/s) and is faster than MobileViTV2-0.75 and FastViT-T8 at equivalent or higher accuracy.
- High-resolution (384², 512²): EfficientViM-M4 attains 80.9% @3724 img/s (384²), 81.9% @2452 img/s (512²), scaling substantially better with input size than SHViT, EMO, or MobileOne.
ViM² in MambaMixer and V2M architectures (Behrouz et al., 2024, Wang et al., 2024):
- On ImageNet-1K (224²): ViMā-T reaches 82.7% with 20M params, surpassing VMamba-T, Swin-T, and MLP-Mixer-B/16.
- On ADE20K (UPerNet head): ViMā-T achieves mIoU 48.6% vs. 47.3% for VMamba-T, 44.4% for Swin-T.
- On COCO (Mask R-CNN 1Ć): ViMā-T yields box AP 47.1, mask AP 42.4, outperforming VMamba-T and Swin-T.
MobileViM for 3D medical segmentation (Dai et al., 19 Feb 2025):
- Dice scores: 92.7% (PENGWIN CT), 86.7% (BraTS2024 MRI), 80.5% (ATLAS liver), 77.4% (ToothFairy2 dental).
- Model size: 2.9ā6.3M params, inference speed FPS on RTX4090āsignificantly more efficient than nnUNet and SwinUNETR-V2.
7. Comparative Analysis and Design Trade-Offs
Ablation studies across the ViM² literature characterize key design parameters:
- Token mixer choice: Replacing HSM-SSD with NC-SSD increases compute by ā, with a drop of up to 1.4 pp top-1.
- # States per stage: Increasing across stages ([49, 25, 9]) consistently improves performance versus fixed or decreasing .
- Normalization: Partial LayerNorm prior to HSM-SSD, with BatchNorm elsewhere, strikes the best stability/performance balance.
- Multi-head vs single-head: Single-head HSM-SSD with per-state weights matches accuracy but further increases throughput by 8%.
- Selective channel mixing: Ablations replacing SCM with an MLP lose up to 3.6 pp top-1 or mIoU, emphasizing the necessity of data-dependent channel mixing.
- Multi-stage fusion: Provides top-1 at negligible cost, validating hypothesis that lower-stage representations substantively aid global classification.
A plausible implication is that these architectural motifs (SSM recurrences, selective, input-adaptive mixing, hierarchical fusion, and dimension-agnostic design) generalize beyond vision, offering templates for efficient sequence processing in other domains.
References:
- EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality (Lee et al., 2024)
- MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection (Behrouz et al., 2024)
- V2M: Visual 2-Dimensional Mamba for Image Representation Learning (Wang et al., 2024)
- Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (Zhu et al., 2024)
- MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis (Dai et al., 19 Feb 2025)