Papers
Topics
Authors
Recent
Search
2000 character limit reached

Visual State Space Model (VMamba)

Updated 22 February 2026
  • VMamba is a vision backbone architecture that applies state-space models with adaptive scanning routes to maintain global receptive fields and linear computational complexity.
  • It utilizes a 2D-to-1D Selective Scan (SS2D) module and extends to MSVMamba with multi-scale scanning to boost accuracy in classification, detection, and segmentation while reducing computational load.
  • The design integrates layer normalization, lightweight ConvFFN, and residual connections, providing a hardware-efficient, high-performing alternative to ViTs and CNNs.

A Visual State Space Model (VMamba) is a vision backbone architecture that adapts the selective state space models (SSMs)—notably the S6/Mamba block—to hierarchical visual processing, preserving global receptive fields, linear computational complexity, and input-adaptive recurrence. VMamba and its multi-scale extension, MSVMamba, provide hardware-efficient, high-accuracy alternatives to Vision Transformers (ViTs) and convolutional networks across classification, detection, and segmentation (Shi et al., 2024).

1. Mathematical and Architectural Foundations

At the core of VMamba lies the continuous-time state-space system:

h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A h(t) + B x(t), \quad y(t) = C h(t)

where ARN×NA \in \mathbb{R}^{N \times N}, BRN×1B \in \mathbb{R}^{N \times 1}, and CR1×NC \in \mathbb{R}^{1 \times N}. Discretization (step size Δ\Delta) yields

A=exp(ΔA),BΔB\overline{A} = \exp(\Delta A), \quad \overline{B} \approx \Delta B

and the recurrence

h[n]=Ah[n1]+Bx[n],y[n]=Ch[n]h[n] = \overline{A} h[n-1] + \overline{B} x[n], \quad y[n] = C h[n]

which is equivalent to a 1D convolutional kernel over the sequence xx (Eq. 3 in (Shi et al., 2024)). Critically, the S6 selective SSM variant parameterizes B,C,Δ\overline{B}, C, \Delta as input-dependent via token-wise MLPs, leading to content-adaptive kernels (Eq. 4).

VMamba replaces quadratic-complexity attention in vision backbones with a stack of Visual State Space (VSS) blocks. Each VSS block operates on a feature map XRH×W×DX \in \mathbb{R}^{H \times W \times D} using a 2D-to-1D Selective Scan (SS2D) module: the feature map is flattened along multiple scanning routes σk\sigma_k (typically four raster routes: left→right, right→left, top→bottom, bottom→top), each processed by an S6 block, and outputs summed at each position.

The canonical VMamba block includes:

This block structure is arranged hierarchically (stem → four decreasing-resolution stages → global pool → classifier), yielding variants with varying depth and channel widths (Nano, Micro, Tiny, etc., as in Table 1 of (Shi et al., 2024)).

2. Multi-Scale 2D Scanning and MSVMamba

The principal extension in Multi-Scale VMamba (MSVMamba) introduces a Multi-Scale 2D Scanning (MS2D) module to further enhance efficiency and long-range dependency modeling (Shi et al., 2024). In MS2D,

  • The full-resolution branch performs one SS2D scan on Z1RH×W×DZ_1 \in \mathbb{R}^{H \times W \times D} (DWConv with stride 1).
  • The half-resolution branch applies three SS2D scans on Z2RH/2×W/2×DZ_2 \in \mathbb{R}^{H/2 \times W/2 \times D} (DWConv with stride 2).
  • Outputs from downsampled scans are interpolated back to the original resolution, summed with the full-res output.

Algorithmically,

  1. Z1=DWConv1(Z)Z_1 = \mathrm{DWConv}_1(Z)
  2. Z2=DWConv2(Z)Z_2 = \mathrm{DWConv}_2(Z)
  3. Y1=S6(σ(Z1))Y_1 = \mathrm{S6}(\sigma(Z_1))
  4. {Y2,3,4}=S6(σ(Z2))\{Y_{2,3,4}\} = \mathrm{S6}(\sigma(Z_2))
  5. Reshape and interpolate outputs, then Z=Z1+(Z2+Z3+Z4)Z' = Z'_1 + (Z'_2 + Z'_3 + Z'_4)

This design reduces token work per block from $4L$ to L+3L/4L + 3L/4 (for s=2s=2)—a ~44% sequence work reduction for similar accuracy. Total per-block cost is O((1+3/s2)LDN)O((1+3/s^2)LDN).

3. Channel Mixing and Block Composition

To compensate for limited channel mixing in SSM architectures, MSVMamba integrates a lightweight Convolutional Feed-Forward Network (ConvFFN) after MS2D. ConvFFN comprises:

  • 1×1 linear expansion (D2DD \rightarrow 2D)
  • 3×3 depthwise convolution (padding=1)
  • Activation (GELU or ReLU)
  • 1×1 linear projection (2DD2D \rightarrow D)

This structure is directly analogous to Transformer FFNs but incorporates local spatial mixing.

Each Multi-Scale State Space (MS3) block in MSVMamba thus includes:

  • MS2D space–state sublayer
  • Squeeze-Excitation
  • ConvFFN (channel mixer)
  • Residual and normalization layers

Backbone organization remains four-stage, with resolution decreasing per stage.

4. Computational Complexity and Scaling

A core motivation for VMamba and MSVMamba is maintenance of linear computational complexity:

  • Vision Transformer (ViT) attention: O((HW)2D)O((HW)^2 D)
  • Vanilla SSM (S6): O(LDN)O(L D N) time, O(LN)O(LN) memory
  • VMamba-SS2D: O(kLDN)O(kLDN) (typically k=4k=4)
  • MSVMamba MS2D: O((1+3/s2)LDN)O((1+3/s^2) L D N), with s=2s=2, effectiveness at 1.75L work versus 4L in naive multi-route scans

All additional operations (SE, ConvFFN) are O(LD2)O(L D^2) or less. This enables scaling to high-resolution images without prohibitive resource demands—a fundamental advantage over ViT quadratically-scaling attention (Shi et al., 2024).

5. Experimental Results and Comparative Assessment

Extensive benchmarks demonstrate the efficacy of VMamba and MSVMamba. On ImageNet-1K (300-epoch training, standard settings):

  • MSVMamba-Tiny achieves 82.8% top-1 accuracy (+0.6% over VMamba-T, –17% GFLOPs)
  • COCO + Mask R-CNN (1×): box AP 46.9, mask AP 42.2 (+4.2/+2.9 vs. Swin-T)
  • ADE20K (UperNet, single-scale): mIoU 47.6 (+2.2 vs. Swin-T)

Ablation studies on Nano-scale models (100 epochs, VMamba-Nano baseline 69.6%):

  • +MS2D: 71.9% (+2.3)
  • +SE: 72.4% (+0.5)
  • +ConvFFN: 74.4% (+2.0)
  • Increased SSM state dimension N=1, +1 block: 75.1% (+0.7)

MSVMamba thus matches or outperforms comparable ViT and CNN backbones at reduced FLOPs, while maintaining linear scaling properties (Shi et al., 2024).

6. Robustness, Limitations, and Future Directions

VMamba and MSVMamba's multi-scale scanning alleviate long-range forgetting, a phenomenon in SSMs where distant token interactions become attenuated or lost—especially in deep or large-scale models. By mixing full- and downsampled scans, MSVMamba preserves both local and global receptive fields efficiently.

ConvFFN modules reintroduce strong channel mixing absent in pure SSM pipelines. This hybrid design restores much of the inductive bias historically supplied by convolutional modules.

Limitations arise as overall model size increases: long-range forgetting deteriorates due to sheer parameter count, suggesting diminishing returns from multi-scale routing in large models. Open research questions remain in hardware-aware 2D SSM implementations, adaptive or learned multi-scale routings, and explicit integration of temporal modeling (for video and dynamic visual tasks).

Key directions include:

  • Adaptive multi-scale scan path learning
  • Optimized kernel fusion for 2D SSMs on modern hardware
  • Cross-domain extensions (video, multimodal fusion)
  • Better channel–spatial–scale coupling in future architectures

7. Broader Impact and Context

The VMamba/MSVMamba paradigm demonstrates that state-space paradigms, with input-adaptive, structured scanning and principled architectural enhancements, can match or surpass Transformer-based models in both accuracy and scaling efficiency for core visual recognition tasks.

MSVMamba, with its hierarchical, multi-scale scanning and convolutional FFN modules, stands as a state-of-the-art, theoretically and empirically grounded vision backbone, with global receptive effective field, efficient computation, and broad applicability across dense and fine-grained perception tasks (Shi et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Visual State Space Model (VMamba).