A Survey on Visual Mamba

Published 24 Apr 2024 in cs.CV | (2404.15956v2)

Abstract: State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.

Abstract PDF HTML Upgrade to Chat

References (105)

Citations (29)

View on Semantic Scholar

Summary

The paper demonstrates that Visual Mamba, a selective state space model, achieves efficient long-sequence vision modeling with linear complexity.
It details innovative scanning mechanisms and hybrid architectures that integrate Mamba with convolution, recurrence, and attention for diverse imaging tasks.
The survey highlights strong performance in general, medical, and remote sensing imaging, reducing computational costs while maintaining competitive accuracy.

Survey of Visual Mamba: State Space Models for Efficient Vision

Introduction and Motivation

The surveyed paper provides a comprehensive analysis of the adaptation and application of Mamba, a selective state space model (SSM), to computer vision tasks. Mamba was originally introduced for efficient long-sequence modeling in NLP, offering linear computational complexity and hardware-aware design. The quadratic complexity of self-attention in Transformers, especially for high-resolution images, motivates the exploration of SSMs as alternatives for vision. The survey systematically categorizes foundational and enhanced Mamba architectures, details their integration with convolution, recurrence, and attention, and reviews their deployment across general, medical, and remote sensing vision tasks.

Mathematical Foundations and Architecture

Mamba builds on the SSM framework, which models sequences via hidden states evolving under linear ODEs, discretized for deep learning. The key innovation is the selective scan mechanism, where SSM parameters become input-dependent, enabling dynamic information filtering and improved context modeling. The Mamba block (Figure 1) integrates gated MLPs, SSMs, and local convolutions, with normalization and residual connections for stability and expressivity.

Figure 1: Mamba Block architecture, illustrating the integration of gated MLPs, SSMs, and local convolution for efficient sequence modeling.

The discretized SSM is implemented as a global convolution, with the kernel computed from the evolution and projection parameters. Selective SSMs further generalize this by making $B$ , $C$ , and $\Delta$ functions of the input, allowing for context-dependent state evolution. The scan operation is hardware-optimized, leveraging parallelization and kernel fusion for efficient GPU utilization.

Adaptation to Vision: Scanning Mechanisms and Blocks

Vision tasks require processing multi-dimensional data. The survey details the adaptation of Mamba to 2D and 3D inputs via specialized scanning mechanisms. The ViM block and VSS block (Figure 2) are foundational, enabling bidirectional and cross-scan operations over image patches.

Figure 2: ViM Block and VSS Block, foundational components for adapting Mamba to visual data.

A taxonomy of scanning strategies is presented (Figure 3), including bidirectional, cross-scan, continuous 2D, local, efficient (atrous), zigzag, omnidirectional, hierarchical, spatiotemporal, and multi-path scans. These mechanisms are critical for balancing local and global context modeling, computational efficiency, and spatial continuity.

Figure 3: Comparison of 2D scanning and selective scan orders across various Mamba-based vision architectures.

Backbone and Enhanced Architectures

The survey reviews pure Mamba backbones (ViM-based, VSS-based, Mamba-ND, SiMBA, EfficientVMamba) and their hierarchical, windowed, and multi-dimensional variants. EfficientVMamba introduces atrous scanning for lightweight models, while MambaMixer and Mamba-ND generalize selection across tokens and channels, and to higher dimensions.

Integration with other architectures is explored:

Convolution: RES-VMAMBA incorporates residual connections for local-global feature fusion.
Recurrence: VMRNN combines VSS blocks with LSTM for spatiotemporal modeling.
Attention: SSM-ViT and MMA blocks fuse SSMs with self-attention and channel attention for enhanced representation.

Applications in General Vision Tasks

Mamba-based models are evaluated across high/mid-level (classification, detection, segmentation, video understanding, multimodal fusion) and low-level (restoration, super-resolution, generation) vision tasks. Notable findings include:

Linear complexity enables efficient processing of long sequences and high-resolution inputs.
ViM, VMamba, PlainMamba, LocalMamba achieve competitive accuracy with reduced FLOPs and parameter counts.
VideoMamba and Video Mamba Suite demonstrate scalability and efficiency for video understanding.
MambaIR, SERPENT, VmambaIR outperform transformer-based baselines in image restoration with lower memory and computation.
Point cloud models (SSPointMamba, 3DMambaComplete) leverage Mamba for efficient global modeling and geometric reasoning.

Medical Imaging: 2D and 3D Segmentation, Classification, Registration

Mamba architectures have rapidly proliferated in medical imaging, particularly for segmentation (Figure 4). U-Mamba, H-vmunet, UltraLight VM-UNet, VM-UNet, and VM-UNET-V2 adapt Mamba blocks to U-Net and hierarchical designs, achieving strong results in 2D segmentation with significant parameter and FLOP reductions.

Figure 4: Overview of Mamba models for segmentation in 2D medical images, highlighting architectural diversity and efficiency.

3D segmentation models (SegMamba, LightM-UNet, LMa-UNet, T-Mamba, Vivim) extend scanning to volumetric data, with tri-orientated and frequency-enhanced blocks. MambaMorph and VMambaMorph address deformable registration, while MedMamba and MambaMIL target classification and long-sequence modeling in pathology.

Challenges remain in pre-training, interpretability, robustness, and real-time deployment, especially for distributed and edge medical applications.

Remote Sensing: Dense Prediction, Change Detection, Pan-sharpening

Mamba's linear complexity and hardware-aware design are particularly advantageous for remote sensing, where image sizes are large and context modeling is critical. MiM-ISTD, RSMamba, RS-Mamba, HSIMamba, Pan-Mamba, ChangeMamba, RS3Mamba, and Samba demonstrate the versatility of Mamba in pan-sharpening, small target detection, hyperspectral classification, dense prediction, and semantic segmentation. Omnidirectional and multi-path scanning mechanisms are frequently employed to capture spatial dependencies efficiently.

Performance, Resource Requirements, and Scaling

Across surveyed works, Mamba-based models consistently achieve competitive or superior accuracy with reduced computational and memory footprints compared to transformer baselines. FLOPs and parameter counts are often halved or better, and inference speed is improved, especially for long sequences and high-resolution data. Hardware-aware scan implementations further enhance throughput on modern GPUs.

Trade-offs include the need for careful scan mechanism selection to balance local and global context, and potential limitations in modeling highly non-local dependencies without attention. Hybrid architectures (Mamba + attention/convolution) can mitigate these issues.

Implications and Future Directions

The surveyed research demonstrates that Mamba and selective SSMs are viable alternatives to transformers for vision, offering substantial efficiency gains and competitive accuracy. Theoretical implications include the generalization of sequence modeling to multi-dimensional, input-dependent state evolution, and the bridging of RNN, CNN, and transformer paradigms.

Practically, Mamba enables deployment of high-capacity models on edge devices and real-time systems, and facilitates scaling to large images, videos, and 3D data. Future developments may include:

Improved pre-training strategies and transfer learning for Mamba-based vision models.
Enhanced interpretability and robustness, especially in medical and safety-critical domains.
Further integration with attention and convolution for hybrid architectures.
Distributed and federated deployment for large-scale remote sensing and medical imaging.

Conclusion

Visual Mamba and selective state space models represent a significant evolution in efficient vision modeling, addressing the computational bottlenecks of transformers while maintaining or improving accuracy. The surveyed architectures and applications highlight the flexibility, scalability, and practical utility of Mamba in diverse vision domains. Continued research into scan mechanisms, hybrid designs, and deployment strategies will further advance the state of the art in efficient, high-capacity vision models.