Mamba-SAM: Hybrid SSM for Efficient Segmentation
- Mamba-SAM is a hybrid segmentation framework that combines a frozen SAM Vision Transformer with efficient Mamba state-space modules for robust global and local context integration.
- It employs adapter-based augmentation and dual-branch fusion strategies to achieve high parameter efficiency and rapid convergence across diverse applications.
- Empirical results demonstrate superior performance in 3D medical imaging, polyp segmentation, and light field detection, outperforming traditional CNN and transformer baselines.
Mamba-SAM refers to a family of hybrid architectures and adaptation strategies that synergistically combine the Segment Anything Model (SAM)—typically a frozen, large Vision Transformer (ViT) pretrained on natural images as a universal segmentation backbone—with Mamba, a class of selective state space models (SSMs) that achieve linear-time global context propagation. Mamba-SAM variants have been developed independently across multiple domains, including 2D/3D medical image segmentation, polyp segmentation, light field salient object detection, and general semantic segmentation. The unifying principle is to combine SAM’s promptability and pretraining scale with Mamba’s computational efficiency for long-range modeling, often yielding highly parameter- and data-efficient solutions with robust generalization.
1. Hybrid Architectures and Core Principles
At the architectural level, Mamba-SAM encompasses several integration strategies:
- Adapter-based augmentation: Lightweight Mamba or Tri-Plane Mamba (TP-Mamba) state-space modules are injected into each block of the frozen SAM ViT, often in parallel or sequence with LoRA adapters and/or multi-scale convolutional adapters. These modules are responsible for encoding local and global volumetric or spatial correlations, especially in data regimes where 3D context is paramount but full retraining or quadratic self-attention is infeasible (Wang et al., 2024, Shahraki et al., 31 Jan 2026).
- Dual-branch or cross-attentional fusion: A separate, trainable Mamba or VMamba encoder specializes on domain adaptation while the SAM branch remains frozen. The feature spaces from the Mamba and SAM modules are fused, typically via cross-branch attention, with the result decoded by a lightweight or learnable decoder (Shahraki et al., 31 Jan 2026, Huo et al., 22 May 2025).
- Domain-specific priors and prompt adaptation: In certain applications (e.g., zero-shot polyp segmentation), a “Mamba-Prior” module injects domain-specific cues into the frozen SAM encoder via multi-scale spatial decomposition and Mamba channel interaction blocks, enhancing sensitivity to target structures otherwise missed by generalist pretraining (Dutta et al., 2024, Dutta et al., 2 Jul 2025).
A shared emphasis is placed on parameter efficiency: The majority of the model (SAM backbone) is kept frozen, with adaptation restricted to small numbers of newly introduced parameters (adapters or specialist encoders), typically on the order of <10% of the total model capacity (Wang et al., 2024, Dutta et al., 2024, Huo et al., 22 May 2025).
2. State-Space Models and the Mamba Block
Mamba blocks leverage selective SSMs to achieve linear scaling with respect to sequence length. In contrast to standard ViT attention (O(N²) complexity), the Mamba block maintains (for each token) hidden states that are recursively updated, enabling information integration over entire spatial or volumetric extents without explicit pairwise attention (Wang et al., 2024, Yuan et al., 2024, Dutta et al., 2 Jul 2025). In Tri-Plane Mamba, state-space scanning is performed along HW, DH, and DW planes, fusing outputs for global context at O(L·r) cost per plane (Wang et al., 2024).
The SSM equations take the form: with discrete updates operating per block or per plane, enhanced with input-dependent parameterizations and nonlinear gating.
Notably, 1D-2D and tri-plane variants of the Mamba block are commonly used depending on the dimensionality of the target problem and the type of context (volumetric, boundary, or spatial) required (Wang et al., 2024, Dutta et al., 2024, Dutta et al., 2 Jul 2025).
3. Parameter and Computational Efficiency
Mamba-SAM explicitly addresses the prohibitive cost of naive 3D self-attention or large ViT finetuning, especially on modest GPU hardware. Key design features include:
- Per-ViT-block adapter overhead is kept under 1M parameters even with TP-Mamba, contrasting with >1.2M for a full 3D self-attention implementation (Wang et al., 2024).
- Floating point operation (FLOP) count for TP-Mamba adapters is typically two orders of magnitude lower per block than quadratic attention (e.g., +0.45 GFlops/block vs. +18.86 GFlops/block for quadratic attention on 96³ input) (Wang et al., 2024).
- Adapter-based variants with Mamba or TP-Mamba achieve full 3D segmentation with <25% of total parameters needing to be trained, providing practical fine-tuning even with scarce medical datasets (Shahraki et al., 31 Jan 2026).
Empirically, Mamba-SAM maintains or exceeds the segmentation accuracy of reigning baselines—UNet++, nnUNet, Swin-Unet—while achieving much faster convergence (e.g., 200 epochs to reach full-data performance vs. 1000 epochs for legacy methods) (Wang et al., 2024).
4. Application Domains and Empirical Results
3D Medical Image Segmentation
Mamba-SAM, particularly in the TP-Mamba formulation, demonstrates state-of-the-art performance on volumetric segmentation tasks with minimal annotated data. When trained on only 12% of the BTCV dataset (3 CT volumes), TP-Mamba adapters on a frozen SAM backbone outperform strong 3D CNN and transformer baselines by up to 12.3 Dice points (Wang et al., 2024). Similarly, in cardiac MRI, hybrid dual-branch architectures achieve mean Dice scores of 0.906, matching or exceeding UNet++ and outperforming MambaUNet (Shahraki et al., 31 Jan 2026).
2D Medical and Histopathology Segmentation
Mamba-SAM variants incorporating uncertainty-aware losses and Sharpness-Aware Minimization (SAM) optimizers set new benchmarks for cardiac (ACDC) and cardiovascular (Aorta, ImageCAS) segmentation. The combination of Mamba SSMs, robust loss functions, and flat-minima optimization yields consistent improvements in Dice Similarity Coefficient, stability, and generalization even as data availability decreases (Tsai et al., 2024, Tsai et al., 2024).
Polyp Segmentation and Real-Time Inference
In generalized polyp segmentation, boundary-guided Mamba-SAM configurations such as SAM-Mamba and SAM-MaGuP utilize Mamba-guided adapters and boundary distillation modules for unmatched performance across both seen and zero-shot unseen domains (Kvasir-SEG, CVC, ETIS). These designs exhibit high speed (25–30 FPS on A100), small incremental parameter counts (<10M), and improved mDice/mIoU over adapter-only and transformer-only rivals (Dutta et al., 2 Jul 2025, Dutta et al., 2024).
Light Field and Weakly Supervised Tasks
LFSamba demonstrates that the Mamba-SAM framework is extensible to multi-focus light field salient object detection, leveraging Mamba modules to model inter-slice (focal stack) and inter-modal relations atop SAM-based feature extraction. Even under scribble supervision, LFSamba outperforms all contemporary methods by a significant margin (Liu et al., 2024). Similarly, SparseMamba-PCL incorporates Mamba with Med-SAM for scribble-supervised medical segmentation, showing state-of-the-art results on ACDC, CHAOS, and MSCMRSeg with innovative skip-sampling for linear inference (Qiu et al., 3 Mar 2025).
5. Loss Functions, Training Strategies, and Robustness
Advanced loss strategies are central to Mamba-SAM success:
- Uncertainty-aware composite loss: Jointly optimizes region-based (Dice), distribution-based (Cross-Entropy), and pixel-level (Focal) losses, with learnable uncertainty weights per component. This adaptively down-weights noisy or ambiguous components during training, improving both segmentation accuracy and MSE robustness (Tsai et al., 2024, Tsai et al., 2024).
- Sharpness-Aware Minimization (SAM): Outer-loop optimization minimization over a ball in parameter space ensures convergence to flat minima, enhancing out-of-distribution generalization and preventing overfitting, especially in regimes with limited annotations (Tsai et al., 2024, Tsai et al., 2024).
- Boundary and distillation objectives: In boundary-challenged modalities (colonoscopy), Mamba-SAM applies distillation from boundary-enhanced representations and boundary losses for sharper edge localization (Dutta et al., 2 Jul 2025, Dutta et al., 2024).
- Prompt-based and zero-shot/few-shot paradigm: Where practical, Mamba-SAM leverages the original prompt encoder/decoder organization of SAM, extending to pseudo-prompts (predicted masks as prompts) or, in 3D, hypothesized extensions to prompt encoders (Wang et al., 2024, Dutta et al., 2024).
6. Limitations and Prospects
Despite notable advances, Mamba-SAM methods face several open technical challenges:
- Prompt adaptation for 3D and interactivity: While 2D prompt encoders are fixed, adaptation to volumetric or interactive 3D cues remains underexplored (Wang et al., 2024).
- Boundary accuracy and refinement: Certain adapter-based variants displaying high Dice may still underperform in fine-grained boundary metrics (HD95, Emax_φ), motivating research into contour-aware losses and decoder design (Shahraki et al., 31 Jan 2026, Dutta et al., 2024).
- Data and modality generalization: Direct extension to non-CT/MRI modalities, as well as escalation to multi-modal segmentation, require careful calibration and further innovation (Wang et al., 2024, Huo et al., 22 May 2025).
- Adaptive aggregation and weighting: Simple summation or static gating in multi-plane or multi-branch fusion may be suboptimal; learned aggregation strategies or attention-weighted fusions are plausible improvements (Wang et al., 2024, Huo et al., 22 May 2025).
In sum, Mamba-SAM constitutes a generalizable, computationally efficient blueprint for leveraging frozen foundation models in data-scarce or high-dimensional segmentation, augmented by linear-time state-space modeling via Mamba at both global and local levels. Its efficacy is empirically supported across diverse tasks, but methodological progress in prompt handling, boundary delineation, and aggregation dynamics will define the next stage of research (Wang et al., 2024, Dutta et al., 2024, Dutta et al., 2 Jul 2025, Shahraki et al., 31 Jan 2026, Tsai et al., 2024, Huo et al., 22 May 2025).