Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Survey on Visual Mamba

Published 24 Apr 2024 in cs.CV | (2404.15956v2)

Abstract: State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
  2. Frank Rosenblatt et al. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms, volume 55. Spartan books Washington, DC, 1962.
  3. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  4. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  5. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  6. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  7. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
  8. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  12. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
  13. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv preprint arXiv:2401.04081, 2024.
  14. Blackmamba: Mixture of experts for state-space models. arXiv preprint arXiv:2402.01771, 2024.
  15. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  16. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  17. Swish: a self-gated activation function. arXiv: Neural and Evolutionary Computing, 2017.
  18. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  19. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  20. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023.
  21. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611, 2021.
  22. Retentive network: A successor to transformer for large language models (2023). URL http://arxiv. org/abs/2307.08621 v1.
  23. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  24. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  25. Can recurrent neural networks warp time? arXiv preprint arXiv:1804.11188, 2018.
  26. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
  27. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024.
  28. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338, 2024.
  29. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695, 2024.
  30. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977, 2024.
  31. Mambamixer: Efficient selective state space models with dual token and channel selection. arXiv preprint arXiv:2403.19888, 2024.
  32. Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv preprint arXiv:2402.05892, 2024.
  33. Simba: Simplified mamba-based architecture for vision and multivariate time series. arXiv preprint arXiv:2403.15360, 2024.
  34. Zigma: Zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  35. Vmambair: Visual state space model for image restoration. arXiv preprint arXiv:2403.11423, 2024.
  36. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  37. Motion mamba: Efficient and long sequence motion generation with hierarchical and bidirectional selective ssm. arXiv preprint arXiv:2403.07487, 2024.
  38. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168, 2024.
  39. Rsmamba: Remote sensing image classification with state space model. arXiv preprint arXiv:2403.19654, 2024.
  40. Harmamba: Efficient wearable sensor human activity recognition based on bidirectional selective ssm. arXiv preprint arXiv:2403.20183, 2024.
  41. Activating wider areas in image super-resolution. arXiv preprint arXiv:2403.08330, 2024.
  42. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024.
  43. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024.
  44. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv preprint arXiv:2403.06467, 2024.
  45. Large window-based mamba unet for medical image segmentation: Beyond convolution and self-attention. arXiv preprint arXiv:2403.07332, 2024.
  46. Motion-guided dual-camera tracker for low-cost skill evaluation of gastric endoscopy. arXiv preprint arXiv:2403.05146, 2024.
  47. Vmrnn: Integrating vision mamba and lstm for efficient and accurate spatiotemporal forecasting. arXiv preprint arXiv:2403.16536, 2024.
  48. Res-vmamba: Fine-grained food category visual classification using selective state space models with deep residual learning. arXiv preprint arXiv:2402.15761, 2024.
  49. Sigma: Siamese mamba network for multi-modal semantic segmentation. arXiv preprint arXiv:2404.04256, 2024.
  50. Remamber: Referring image segmentation with mamba twister. arXiv preprint arXiv:2403.17839, 2024.
  51. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079, 2024.
  52. Semi-mamba-unet: Pixel-level contrastive and pixel-level cross-supervised visual mamba-based unet for semi-supervised medical image segmentation. arXiv e-prints, pages arXiv–2402, 2024.
  53. Vmambamorph: a visual mamba-based framework with cross-scan module for deformable 3d image registration. arXiv preprint arXiv:2404.05105, 2024.
  54. Changemamba: Remote sensing change detection with spatio-temporal state space model. arXiv preprint arXiv:2404.03425, 2024.
  55. H-vmunet: High-order vision mamba unet for medical image segmentation. arXiv preprint arXiv:2403.13642, 2024.
  56. Mambamir: An arbitrary-masked mamba for joint medical image reconstruction and uncertainty estimation. arXiv preprint arXiv:2402.18451, 2024.
  57. Mambair: A simple baseline for image restoration with state-space model. arXiv preprint arXiv:2402.15648, 2024.
  58. Serpent: Scalable and efficient image restoration via multi-scale structured state space models. arXiv e-prints, pages arXiv–2403, 2024.
  59. Integrating mamba sequence model and hierarchical upsampling network for accurate semantic segmentation of multiple sclerosis legion. arXiv e-prints, pages arXiv–2403, 2024.
  60. Rotate to scan: Unet-like mamba with triplet ssm module for medical image segmentation. arXiv preprint arXiv:2403.17701, 2024.
  61. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302, 2024.
  62. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv preprint arXiv:2403.20035, 2024.
  63. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
  64. Vm-unet-v2 rethinking vision mamba unet for medical image segmentation. arXiv preprint arXiv:2403.09157, 2024.
  65. Medmamba: Vision mamba for medical image classification. arXiv preprint arXiv:2403.03849, 2024.
  66. Mim-istd: Mamba-in-mamba for efficient infrared small target detection. arXiv preprint arXiv:2403.02148, 2024.
  67. Rs3mamba: Visual state space model for remote sensing images semantic segmentation. arXiv preprint arXiv:2404.02457, 2024.
  68. Freqmamba: Viewing mamba from a frequency perspective for image deraining. arXiv preprint arXiv:2404.09476, 2024.
  69. Rs-mamba for large remote sensing image dense prediction. arXiv preprint arXiv:2404.02668, 2024.
  70. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems, 28, 2015.
  71. State space models for event cameras. arXiv preprint arXiv:2402.15584, 2024.
  72. Long movie clip classification with state-space video models. In European Conference on Computer Vision, pages 87–104. Springer, 2022.
  73. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding. arXiv preprint arXiv:2404.01174, 2024.
  74. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
  75. U-shaped vision mamba for single image dehazing. arXiv preprint arXiv:2402.04139, 2024.
  76. Hu Gao and Depeng Dang. Aggregating local and global features via selective state spaces model for efficient image deblurring. arXiv preprint arXiv:2403.20106, 2024.
  77. Mambatalk: Efficient holistic gesture synthesis with selective state space models. arXiv preprint arXiv:2403.09471, 2024.
  78. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
  79. 3dmambacomplete: Exploring structured state space model for point cloud completion. arXiv preprint arXiv:2404.07106, 2024.
  80. 3dmambaipf: A state space model for iterative point cloud filtering via differentiable rendering. arXiv preprint arXiv:2404.05522, 2024.
  81. Point could mamba: Point cloud learning via state space model. arXiv preprint arXiv:2403.00762, 2024.
  82. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
  83. Gamba: Marry gaussian splatting with mamba for single view 3d reconstruction. arXiv preprint arXiv:2403.18795, 2024.
  84. Ssm meets video diffusion models: Efficient video generation with structured state spaces. arXiv preprint arXiv:2403.07711, 2024.
  85. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  86. Digital radiography. Springer, 2019.
  87. Overview of guidance for endoscopy during the coronavirus disease 2019 pandemic. Journal of gastroenterology and hepatology, 35(5):749–759, 2020.
  88. X-ray computed tomography. Nature Reviews Methods Primers, 1(1):18, 2021.
  89. Super-resolution ultrasound imaging. Ultrasound in medicine & biology, 46(4):865–891, 2020.
  90. Brain tumor segmentation and classification from magnetic resonance images: Review of selected methods from 2014 to 2019. Pattern recognition letters, 131:244–260, 2020.
  91. Zi Ye and Tianxiang Chen. P-mamba: Marrying perona malik diffusion with mamba for efficient pediatric echocardiographic left ventricular segmentation. arXiv preprint arXiv:2402.08506, 2024.
  92. Promamba: Prompt-mamba for polyp segmentation. arXiv preprint arXiv:2403.13660, 2024.
  93. Weak-mamba-unet: Visual mamba makes cnn and vit work better for scribble-based medical image segmentation. arXiv preprint arXiv:2402.10887, 2024.
  94. Md-dose: A diffusion model based on the mamba for radiotherapy dose prediction. arXiv preprint arXiv:2403.08479, 2024.
  95. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. arXiv preprint arXiv:2403.06800, 2024.
  96. Fd-vision mamba for endoscopic exposure correction. arXiv preprint arXiv:2402.06378, 2024.
  97. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv preprint arXiv:2403.05246, 2024.
  98. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560, 2024.
  99. T-mamba: Frequency-enhanced gated long-range dependency for tooth 3d cbct segmentation. arXiv preprint arXiv:2404.01065, 2024.
  100. Cmvim: Contrastive masked vim autoencoder for 3d multi-modal representation learning for ad classification. arXiv preprint arXiv:2403.16520, 2024.
  101. nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. arXiv preprint arXiv:2402.03526, 2024.
  102. Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. arXiv preprint arXiv:2401.13934, 2024.
  103. Pan-mamba: Effective pan-sharpening with state space model. arXiv preprint arXiv:2402.12192, 2024.
  104. Hsimamba: Hyperpsectral imaging efficient feature learning with bidirectional state space for classification. arXiv preprint arXiv:2404.00272, 2024.
  105. Samba: Semantic segmentation of remotely sensed images with state space model. arXiv preprint arXiv:2404.01705, 2024.
Citations (29)

Summary

  • The paper demonstrates that Visual Mamba, a selective state space model, achieves efficient long-sequence vision modeling with linear complexity.
  • It details innovative scanning mechanisms and hybrid architectures that integrate Mamba with convolution, recurrence, and attention for diverse imaging tasks.
  • The survey highlights strong performance in general, medical, and remote sensing imaging, reducing computational costs while maintaining competitive accuracy.

Survey of Visual Mamba: State Space Models for Efficient Vision

Introduction and Motivation

The surveyed paper provides a comprehensive analysis of the adaptation and application of Mamba, a selective state space model (SSM), to computer vision tasks. Mamba was originally introduced for efficient long-sequence modeling in NLP, offering linear computational complexity and hardware-aware design. The quadratic complexity of self-attention in Transformers, especially for high-resolution images, motivates the exploration of SSMs as alternatives for vision. The survey systematically categorizes foundational and enhanced Mamba architectures, details their integration with convolution, recurrence, and attention, and reviews their deployment across general, medical, and remote sensing vision tasks.

Mathematical Foundations and Architecture

Mamba builds on the SSM framework, which models sequences via hidden states evolving under linear ODEs, discretized for deep learning. The key innovation is the selective scan mechanism, where SSM parameters become input-dependent, enabling dynamic information filtering and improved context modeling. The Mamba block (Figure 1) integrates gated MLPs, SSMs, and local convolutions, with normalization and residual connections for stability and expressivity. Figure 1

Figure 1: Mamba Block architecture, illustrating the integration of gated MLPs, SSMs, and local convolution for efficient sequence modeling.

The discretized SSM is implemented as a global convolution, with the kernel computed from the evolution and projection parameters. Selective SSMs further generalize this by making BB, CC, and Δ\Delta functions of the input, allowing for context-dependent state evolution. The scan operation is hardware-optimized, leveraging parallelization and kernel fusion for efficient GPU utilization.

Adaptation to Vision: Scanning Mechanisms and Blocks

Vision tasks require processing multi-dimensional data. The survey details the adaptation of Mamba to 2D and 3D inputs via specialized scanning mechanisms. The ViM block and VSS block (Figure 2) are foundational, enabling bidirectional and cross-scan operations over image patches. Figure 2

Figure 2

Figure 2

Figure 2: ViM Block and VSS Block, foundational components for adapting Mamba to visual data.

A taxonomy of scanning strategies is presented (Figure 3), including bidirectional, cross-scan, continuous 2D, local, efficient (atrous), zigzag, omnidirectional, hierarchical, spatiotemporal, and multi-path scans. These mechanisms are critical for balancing local and global context modeling, computational efficiency, and spatial continuity. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Comparison of 2D scanning and selective scan orders across various Mamba-based vision architectures.

Backbone and Enhanced Architectures

The survey reviews pure Mamba backbones (ViM-based, VSS-based, Mamba-ND, SiMBA, EfficientVMamba) and their hierarchical, windowed, and multi-dimensional variants. EfficientVMamba introduces atrous scanning for lightweight models, while MambaMixer and Mamba-ND generalize selection across tokens and channels, and to higher dimensions.

Integration with other architectures is explored:

  • Convolution: RES-VMAMBA incorporates residual connections for local-global feature fusion.
  • Recurrence: VMRNN combines VSS blocks with LSTM for spatiotemporal modeling.
  • Attention: SSM-ViT and MMA blocks fuse SSMs with self-attention and channel attention for enhanced representation.

Applications in General Vision Tasks

Mamba-based models are evaluated across high/mid-level (classification, detection, segmentation, video understanding, multimodal fusion) and low-level (restoration, super-resolution, generation) vision tasks. Notable findings include:

  • Linear complexity enables efficient processing of long sequences and high-resolution inputs.
  • ViM, VMamba, PlainMamba, LocalMamba achieve competitive accuracy with reduced FLOPs and parameter counts.
  • VideoMamba and Video Mamba Suite demonstrate scalability and efficiency for video understanding.
  • MambaIR, SERPENT, VmambaIR outperform transformer-based baselines in image restoration with lower memory and computation.
  • Point cloud models (SSPointMamba, 3DMambaComplete) leverage Mamba for efficient global modeling and geometric reasoning.

Medical Imaging: 2D and 3D Segmentation, Classification, Registration

Mamba architectures have rapidly proliferated in medical imaging, particularly for segmentation (Figure 4). U-Mamba, H-vmunet, UltraLight VM-UNet, VM-UNet, and VM-UNET-V2 adapt Mamba blocks to U-Net and hierarchical designs, achieving strong results in 2D segmentation with significant parameter and FLOP reductions. Figure 4

Figure 4: Overview of Mamba models for segmentation in 2D medical images, highlighting architectural diversity and efficiency.

3D segmentation models (SegMamba, LightM-UNet, LMa-UNet, T-Mamba, Vivim) extend scanning to volumetric data, with tri-orientated and frequency-enhanced blocks. MambaMorph and VMambaMorph address deformable registration, while MedMamba and MambaMIL target classification and long-sequence modeling in pathology.

Challenges remain in pre-training, interpretability, robustness, and real-time deployment, especially for distributed and edge medical applications.

Remote Sensing: Dense Prediction, Change Detection, Pan-sharpening

Mamba's linear complexity and hardware-aware design are particularly advantageous for remote sensing, where image sizes are large and context modeling is critical. MiM-ISTD, RSMamba, RS-Mamba, HSIMamba, Pan-Mamba, ChangeMamba, RS3Mamba, and Samba demonstrate the versatility of Mamba in pan-sharpening, small target detection, hyperspectral classification, dense prediction, and semantic segmentation. Omnidirectional and multi-path scanning mechanisms are frequently employed to capture spatial dependencies efficiently.

Performance, Resource Requirements, and Scaling

Across surveyed works, Mamba-based models consistently achieve competitive or superior accuracy with reduced computational and memory footprints compared to transformer baselines. FLOPs and parameter counts are often halved or better, and inference speed is improved, especially for long sequences and high-resolution data. Hardware-aware scan implementations further enhance throughput on modern GPUs.

Trade-offs include the need for careful scan mechanism selection to balance local and global context, and potential limitations in modeling highly non-local dependencies without attention. Hybrid architectures (Mamba + attention/convolution) can mitigate these issues.

Implications and Future Directions

The surveyed research demonstrates that Mamba and selective SSMs are viable alternatives to transformers for vision, offering substantial efficiency gains and competitive accuracy. Theoretical implications include the generalization of sequence modeling to multi-dimensional, input-dependent state evolution, and the bridging of RNN, CNN, and transformer paradigms.

Practically, Mamba enables deployment of high-capacity models on edge devices and real-time systems, and facilitates scaling to large images, videos, and 3D data. Future developments may include:

  • Improved pre-training strategies and transfer learning for Mamba-based vision models.
  • Enhanced interpretability and robustness, especially in medical and safety-critical domains.
  • Further integration with attention and convolution for hybrid architectures.
  • Distributed and federated deployment for large-scale remote sensing and medical imaging.

Conclusion

Visual Mamba and selective state space models represent a significant evolution in efficient vision modeling, addressing the computational bottlenecks of transformers while maintaining or improving accuracy. The surveyed architectures and applications highlight the flexibility, scalability, and practical utility of Mamba in diverse vision domains. Continued research into scan mechanisms, hybrid designs, and deployment strategies will further advance the state of the art in efficient, high-capacity vision models.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 53 likes about this paper.