Vision MambaMixer (ViM2) Neural Architectures

Updated 21 February 2026

Vision MambaMixer (ViM²) is a family of state space model–based neural architectures that use selective token and channel mixing for efficient visual data processing.
It employs the novel HSM-SSD module to compress global context into a latent space, achieving up to 8× throughput improvements with competitive accuracy.
ViM² models generalize across dimensions, enabling robust performance in 2D image tasks and 3D volumetric analyses for applications such as medical segmentation.

Vision MambaMixer (ViM²) refers to a family of state space model (SSM)–based neural network architectures for visual data, most notably instantiated in EfficientViM-M2, V2M, and MobileViM. These designs systematically exploit the efficiency and content-adaptive expressiveness of SSMs, developing hierarchical, highly hardware-friendly models that scale to large vision tasks with favorable trade-offs of speed, memory, and accuracy. ViM² models consistently demonstrate advances over both attention-based Transformers and prior SSM architectures in linear complexity, selective channel and token mixing, and multi-dimensional generalization, with compelling results on classification, detection, segmentation, and 3D volumetric analysis.

1. Motivation and High-Level Design Principles

ViM² models are motivated by the limitations of quadratic-complexity attention and by the desire to extend recent SSM breakthroughs—especially Mamba—to the multi-dimensional, hierarchical nature of visual data. Early Vision Mamba (ViM) (Zhu et al., 2024) replaced attention with bidirectional SSM blocks operating over flattened 1D image sequences, delivering linear cost and memory savings but incurring large $\mathcal{O}(L D^2)$ projection costs per layer (with $L$ tokens and $D$ channels). ViM² architectures address the following key targets:

Reduced computational footprint: EfficientViM-M2 (ViM²) restructures channel-mixing operations into a latent hidden-state space of size $N \ll L$ (“compressed latent”), shifting major complexity from $\mathcal{O}(L D^2)$ to $\mathcal{O}(N D^2)$ —leading to up to 8 $\times$ throughput improvements at comparable accuracy (Lee et al., 2024).
Hierarchical context fusion and selective mixing: ViM² blocks include modules for dual token and channel mixing with data-dependent SSM recurrence, bidirectional or multi-directional scans to respect 2D/3D structure, and multi-stage feature aggregation for both low- and high-level signal preservation (Behrouz et al., 2024, Lee et al., 2024).
Dimension independence for 3D vision: MobileViM generalizes these principles to 3D spatial layouts with dimension-agnostic and cross-scale MambaMix operations, supporting segmentation of volumetric medical imagery at real-time speeds (Dai et al., 19 Feb 2025).

2. Core Module: Hidden State Mixer–based State Space Duality (HSM-SSD)

The EfficientViM-M2 HSM-SSD layer is the linchpin of ViM²’s approach to token and channel mixing in vision. In contrast to the original NC-SSD formulation, which maintains costly $\mathcal{O}(L D^2)$ projections, HSM-SSD compresses global context into a latent $h \in \mathbb{R}^{N \times D}$ (with $N \ll L$ ), on which all subsequent gating and projections are performed:

Architecture-level data flow for HSM-SSD:

Preprocessing: Input sequence $x_\mathrm{in} \in \mathbb{R}^{L \times D}$ undergoes linear projection $\to$ [B, C, Δ, Z], then depthwise conv projections for local mixing.
SSM Recursion: Discretized as $h_t = A_t h_{t-1} + B_t^\top x_t$ , yielding a compressed hidden state $h_\mathrm{in} \in \mathbb{R}^{N \times D}$ .
Channel Mixing in Hidden State: Replace standard $y = C h$ and output projection with:

$f(h) = \left(h \odot \sigma(h_\mathrm{in} W_z)\right) W_\mathrm{out}, \quad W_z, W_\mathrm{out} \in \mathbb{R}^{D \times D}$

Compute $x_\mathrm{out} \approx C f(h)$ , deferring the expensive projection until after compression.

Cost Comparison:

Layer	Original (NC-SSD)	HSM-SSD
Projections	$O(L D^2)$	$O(N D^2)$
SSM Mix	$O(L N D)$	$O(L N D)$

In typical ViM², $N \sim L/4$ or $L/16$ , giving empirical $3\times$ – $8\times$ runtime savings (Lee et al., 2024). Proposition 1 guarantees HSM-SSD recovers NC-SSD when $N=L$ and appropriate settings hold.

3. Multi-Stage Hidden State Fusion and Feature Aggregation

To maximally exploit intermediate hierarchical representations, ViM² aggregates class logits from every block/stage’s final hidden state:

For each stage $s$ , form summary $\bar h^{(s)} = \frac{1}{N^{(s)}} \sum_i h_i^{(s)} \in \mathbb{R}^D$ and pass through a normalized classification head to yield $z^{(s)}$ .
Learnable scalar weights $\beta^{(s)}$ (softmax-normalized) determine the fusion; the final prediction is $z = \sum_{s=0}^S \beta^{(s)} z^{(s)}$ .
Empirically, this “multi-stage fusion” increases ImageNet-1K top-1 by +0.3% with negligible compute (<1%) (Lee et al., 2024).

This approach improves both gradient flow and the expressiveness of learned features, as compared to using only the final stage.

4. Selective Token and Channel Mixing in ViM² Blocks

In MambaMixer-ViM₂ (Behrouz et al., 2024), each block includes:

Selective Token Mixer (STM): Applies S6-based SSM scanning across image tokens in four diagonal directions (TL→BR, TR→BL, BL→TR, BR→TL). The SSM recurrence is performed after a depthwise 2D convolution and gating. The outputs are summed across all directions.
Selective Channel Mixer (SCM): Implements a bidirectional SSM over the channel dimension, employing separate forward and backward passes with input-dependent weights, then recombining outcomes.
Weighted Averaging of Earlier Features (WAEF): Each layer’s token/channel mixer receives a learned weighted average of outputs from all previous mixers. This extension, reminiscent of DenseNet, improves deep gradient propagation and output calibration.

The complexity per block is linear in both sequence length and channel count, a significant reduction from quadratic-complexity standard self-attention.

5. Dimensional Generalization and Directional Scanning

Recent ViM² variants emphasize native $d$ -dimensional handling:

Visual 2-Dimensional Mamba (V2M) (Wang et al., 2024): Constructs a 2D SSM (generalizing Roesser’s model) where each token maintains sub-states for both horizontal and vertical dependencies. Efficiently implemented as pairs of 1D SSM scans (row and column-wise), with four directional sweeps (corresponding to the main axes and their rotations), these blocks maintain linear complexity in $H \times W$ sequence length and preserve explicit 2D locality.
MobileViM for 3D (Dai et al., 19 Feb 2025): Introduces a dimension-independent mixing mechanism for 3D arrays, performing Mamba scanning along depth, height, and width axes separately. Dual-directional scans further improve global context. Cross-scale skip connections (“Scale Bridger”) aggregate multi-resolution features, critical for accurate volumetric segmentation.

6. Empirical Results: Performance on ImageNet-1K, COCO, and Medical Volumes

ViM² models achieve state-of-the-art or highly competitive results in both efficiency and accuracy.

EfficientViM-M2 (“ViM²”, (Lee et al., 2024)) on ImageNet-1K:

Top-1 (224² input): 75.8% (450 ep), throughput 17,005 img/s on RTX3090 (batch=256), 13.9M params, 355M FLOPs.
Outperforms SHViT-S2 (75.2% @ 15,899 img/s) and is $>3\times$ faster than MobileViTV2-0.75 and FastViT-T8 at equivalent or higher accuracy.
High-resolution (384², 512²): EfficientViM-M4 attains 80.9% @3724 img/s (384²), 81.9% @2452 img/s (512²), scaling substantially better with input size than SHViT, EMO, or MobileOne.

ViM² in MambaMixer and V2M architectures (Behrouz et al., 2024, Wang et al., 2024):

On ImageNet-1K (224²): ViM₂-T reaches 82.7% with 20M params, surpassing VMamba-T, Swin-T, and MLP-Mixer-B/16.
On ADE20K (UPerNet head): ViM₂-T achieves mIoU 48.6% vs. 47.3% for VMamba-T, 44.4% for Swin-T.
On COCO (Mask R-CNN 1×): ViM₂-T yields box AP 47.1, mask AP 42.4, outperforming VMamba-T and Swin-T.

MobileViM for 3D medical segmentation (Dai et al., 19 Feb 2025):

Dice scores: 92.7% (PENGWIN CT), 86.7% (BraTS2024 MRI), 80.5% (ATLAS liver), 77.4% (ToothFairy2 dental).
Model size: 2.9–6.3M params, inference speed $>90$ FPS on RTX4090—significantly more efficient than nnUNet and SwinUNETR-V2.

7. Comparative Analysis and Design Trade-Offs

Ablation studies across the ViM² literature characterize key design parameters:

Token mixer choice: Replacing HSM-SSD with NC-SSD increases compute by $3\times$ – $8\times$ , with a drop of up to 1.4 pp top-1.
# States per stage: Increasing $N$ across stages ([49, 25, 9]) consistently improves performance versus fixed or decreasing $N$ .
Normalization: Partial LayerNorm prior to HSM-SSD, with BatchNorm elsewhere, strikes the best stability/performance balance.
Multi-head vs single-head: Single-head HSM-SSD with per-state weights matches accuracy but further increases throughput by 8%.
Selective channel mixing: Ablations replacing SCM with an MLP lose up to 3.6 pp top-1 or mIoU, emphasizing the necessity of data-dependent channel mixing.
Multi-stage fusion: Provides $+0.3\%$ top-1 at negligible cost, validating hypothesis that lower-stage representations substantively aid global classification.

A plausible implication is that these architectural motifs (SSM recurrences, selective, input-adaptive mixing, hierarchical fusion, and dimension-agnostic design) generalize beyond vision, offering templates for efficient sequence processing in other domains.

References:

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality (Lee et al., 2024)
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection (Behrouz et al., 2024)
V2M: Visual 2-Dimensional Mamba for Image Representation Learning (Wang et al., 2024)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (Zhu et al., 2024)
MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis (Dai et al., 19 Feb 2025)

Markdown Report Issue Upgrade to Chat

References (5)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model (2024)

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality (2024)

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection (2024)

MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis (2025)

V2M: Visual 2-Dimensional Mamba for Image Representation Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision MambaMixer (ViM2).

Vision MambaMixer (ViM2) Neural Architectures

1. Motivation and High-Level Design Principles

2. Core Module: Hidden State Mixer–based State Space Duality (HSM-SSD)

3. Multi-Stage Hidden State Fusion and Feature Aggregation

4. Selective Token and Channel Mixing in ViM² Blocks

5. Dimensional Generalization and Directional Scanning

6. Empirical Results: Performance on ImageNet-1K, COCO, and Medical Volumes

7. Comparative Analysis and Design Trade-Offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Vision MambaMixer (ViM2) Neural Architectures

1. Motivation and High-Level Design Principles

2. Core Module: Hidden State Mixer–based State Space Duality (HSM-SSD)

3. Multi-Stage Hidden State Fusion and Feature Aggregation

4. Selective Token and Channel Mixing in ViM² Blocks

5. Dimensional Generalization and Directional Scanning

6. Empirical Results: Performance on ImageNet-1K, COCO, and Medical Volumes

7. Comparative Analysis and Design Trade-Offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research