Mamba-UNet: SSM-Enhanced U-Net

Updated 22 February 2026

Mamba-UNet is a U-Net variant that deploys state-space model blocks to achieve efficient, linear-time global context modeling.
Its architecture combines multi-scale fusion, bidirectional skip connections, and optimized SSMs for superior performance across medical imaging, speech, and super-resolution tasks.
Empirical results indicate that Mamba-UNet outperforms conventional transformer-based methods with significantly lower FLOPs and parameter counts.

Mamba-UNet refers to a class of U-Net–structured architectures in which state-space sequence models, specifically the Mamba model, replace quadratic self-attention or standard convolutional blocks for efficient, long-range sequence or spatial modeling. The explicit goal is to combine the multi-scale fusion and skip-connection benefits of U-Net with the linear-time, global context-capturing capacity of Mamba state-space models. Mamba-UNet has found direct application both in medical image segmentation (2D/3D, supervised/weakly-supervised/semi-supervised), image super-resolution, and monaural speech enhancement. This entry synthesizes the salient technical developments, architectural patterns, mathematical underpinnings, empirical findings, and open directions for research on Mamba-UNets.

1. Key Architectural Components and Computation

All Mamba-UNet variants share a U-Net–like encoder–decoder skeleton with multi-resolution skip connections. The principal innovation is the systematic replacement of self-attention or convolution with structured state-space model (SSM) blocks that execute global, linear-complexity mixing at each scale.

Canonical Mamba SSM Core

At every core block, Mamba solves a linear, time-invariant state-space ODE discretized via zero-order hold: $\dot x(t) = A\,x(t) + B\,u(t),\qquad y(t) = C\,x(t) + D\,u(t)$ with learned matrices $A,B,C,D$ , input $u(t)$ , state $x(t)$ , output $y(t)$ . After discretization,

$x_k = \overline A\,x_{k-1} + \overline B\,u_k,\qquad y_k = C\,x_k + D\,u_k$

with $\overline A = e^{\Delta A}$ , $\overline B = A^{-1}(e^{\Delta A}-I)B$ . This recurrence is efficiently implemented as a global 1D convolution over length $L$ : $y = u * \overline K,\qquad \overline K = [C\overline B,\,C\overline A\overline B,\,\ldots,\,C\overline A^{L-1}\overline B].$ The computational complexity is $A,B,C,D$ 0 per channel (for state-size $A,B,C,D$ 1), versus $A,B,C,D$ 2 for self-attention.

U-Net Integration

In Mamba-UNet, these SSM blocks appear:

At each encoder/decoder stage, operating on sequence tokens (flattened patches, time-frames, or pixels).
Bidirectionally: both forward and reverse recurrences are applied and fused, enhancing information integration much like BLSTM in sequence tasks.
In both modalities: TS-Mamba and FS-Mamba alternatively model dependencies along time and frequency axes (for audio), or row/column/diagonal axes (for images).

A representative U-Net pass alternates:

Downsampling: patch embedding or strided convolutions.
Core block: stack(s) of bidirectional Mamba SSM modules (with or without adaptive gating, local conv pre-processing, or hybrid attention).
Upsampling: transpose-conv or channel → space shuffling.
Skip-connections: concatenation, addition, or attention-based fusion between corresponding encoder and decoder levels.

Distinct Mamba-UNet variants add local detail via convolution (e.g., depthwise-separable convs) in parallel or extended multi-branch modules for balancing local and global feature extraction (Fan et al., 25 Dec 2025).

2. Mathematical and Implementation Details

The SSM block operates on a 1D (or windowed/flattened 2D/3D) sequence $A,B,C,D$ 3. Bidirectional contexts are computed as: $A,B,C,D$ 4

$A,B,C,D$ 5

For visual tasks, windowing, patchifying, or directionally selective scanning (SS2D, ISS2D) addresses spatial anisotropy and enhances 2D/3D context (Zhang et al., 2024, Ji et al., 2024, Wang et al., 2024). Hybrid modules may combine bidirectional Mamba with residual CNN layers, channel-split parallelism, local feature perception branches, or attention-augmented fusion for efficient local-global integration (Fan et al., 25 Dec 2025, Wu et al., 2024, Xie et al., 21 Mar 2025).

In speech enhancement applications, STFT is performed to obtain time–frequency inputs; magnitude and phase are handled by separate output branches, each reconstructed with a learnable masking (magnitude) or arctan2 output (phase) before ISTFT (Wang et al., 2024).

3. Efficiency, Scalability, and Empirical Performance

Mamba-UNet’s central advantage is its linear-complexity global modeling. Empirical benchmarks across modalities and datasets demonstrate:

Model	Params	FLOPs	Task/Domain	SOTA Metric / Score	Reference
Mamba-SEUNet (L)	6.3M	18.2G	Speech enhance.	PESQ = 3.73 (with PCS)	(Wang et al., 2024)
LightM-UNet	1.1–1.9M	268–458M	Med. img. segment	DSC = 84.6–96.2% (2D/3D)	(Liao et al., 2024)
SAMba-UNet	–	–	Cardiac MRI segm.	Dice = 0.9103, HD95 = 1.086 mm	(Huo et al., 22 May 2025)
UltraLight VM-UNet	0.049M	0.060G	Skin lesion segm.	Dice = 0.91 (ISIC17), ACC = 0.965	(Wu et al., 2024)
UltraLBM-UNet	0.034M	0.060G	Skin lesion segm.	Dice = 88.8% (ISIC17)	(Fan et al., 25 Dec 2025)
MM-UNet (Meta)	–	–	3D Med. img. seg.	Dice = 91.0% (AMOS2022)	(Xie et al., 21 Mar 2025)
Swin-UMamba† (w/ pretrain)	28M	19G	Abdomen MRI	DSC = 0.7705	(Liu et al., 2024)
MSVM-UNet	35.9M	15.5G	Synapse (Multi-org)	Dice = 85.0%, HD95 = 14.75mm	(Chen et al., 2024)

Compared to transformer-based or vanilla UNet baselines, Mamba-UNet achieves state-of-the-art accuracy with 3–100x lower FLOPs and parameter counts, supporting deployment in real-time, mobile, or low-power environments.

4. Application Domains and Data Regimes

The Mamba-UNet family is deployed in several domains:

Medical Image Segmentation: 2D and 3D tasks, including cardiac MRI, abdominal CT, skin lesion segmentation, vessel and polyp segmentation. Architectures range from pure Mamba-based backbones (Wang et al., 2024, Zhang et al., 2024) to hybrid modules integrating CNN branches, Kolmogorov–Arnold networks, or attention-style fusion. Weakly-supervised, semi-supervised, and knowledge-distilled variants are demonstrated to outperform purely CNN and ViT U-Nets in low-annotation and dense-labeling regimes (Ma et al., 2024, Wang et al., 2024, Fan et al., 25 Dec 2025, Wu et al., 2024).
Speech Enhancement: Mamba-SEUNet leverages bidirectional SSMs in both time and frequency axes for waveform-level enhancement, outperforming transformer and conformer U-Net baselines by matching or exceeding PESQ and subjective MOS metrics at vastly lower FLOPs (Wang et al., 2024).
Super-Resolution: Mamba-UNet variants incorporating adaptive directional 2D-scans and self-prior learning (via brightness masking) demonstrate state-of-the-art PSNR/SSIM on MRI and fastMRI datasets, exceeding transformer/cnn baselines (Ji et al., 2024).
Wireless Communication and Other Spatiotemporal Fields: In RadioMamba, U-Net blocks hybridize Mamba SSMs and convolution for radio map construction, delivering the best-known accuracy–efficiency balance relative to both convolutional and diffusion-based generative methods (Jia et al., 28 Jul 2025).

5. Empirical Insights, Ablations, and Limitations

Extensive ablation studies provide design guidance:

Block Count: Increasing the number of bidirectional Mamba blocks yields monotonically improved task scores up to a plateau (e.g., 1 → 4 blocks: PESQ rises from 3.46 → 3.57) (Wang et al., 2024).
Hybrid Integration: Embedding Mamba SSMs within residual blocks after CNN layers mitigates the high-variance modeling issues seen in high-frequency medical images (Xie et al., 21 Mar 2025).
Bidirectionality: Bidirectional (BLSTM-style) Mamba fusion systematically outperforms unidirectional SSMs, especially in spatially-structured data (Wang et al., 2024, Fan et al., 25 Dec 2025).
Parameter Reduction: Channel-wise splitting and parallel VMamba branches dramatically reduce model size (0.049M params in UltraLight VM-UNet, 0.034M in UltraLBM-UNet) with minor or no accuracy loss (Wu et al., 2024, Fan et al., 25 Dec 2025).
Local Context Branches: Combining SSMs with convolution branches for local information enhances performance, especially at skip-fusion bottlenecks (Jia et al., 28 Jul 2025, Fan et al., 25 Dec 2025).
Window/Scan Direction: Large receptive fields are best obtained via large Mamba kernels, multi-directional (cardinal & diagonal) scanning, and hybrid positional encodings (Wang et al., 2024, Chen et al., 2024, Ji et al., 2024).

Performance is robust across architectures and scales but may be limited in:

3D volumetric tasks requiring non-flattened SSMs (Xie et al., 21 Mar 2025).
Small organ/structure detection due to limited fine-scale bias in global SSMs, motivating future multi-scale and local enhancement (Jiang et al., 2024, Zhang et al., 2024).

6. Enhancement Strategies and Training Protocols

Variants introduce post-processing or auxiliary modules for further improvement:

Perceptual Contrast Stretching: Nonlinear rescaling of magnitude spectra for enhanced speech harmonic contrast, yielding a PESQ gain of 0.14 (Wang et al., 2024).
Self-supervised Priors: Brightness inpainting and contrastive pixel-level penalties for data-efficient training (Ji et al., 2024, Ma et al., 2024).
Hybrid Knowledge Distillation: For models with severe resource constraints, hybrid KD (segmentation, decoupled logits, attention transfer, gradient loss) trains ultra-compact students (UltraLBM-UNet-T: 0.011M/0.019 GFLOPs) with negligible accuracy loss (Fan et al., 25 Dec 2025).
Ablation-Driven Training Schedules: Empirical optimization of depth, block placement, local branch count, window/patch size, and skip-fusion for each domain and modality (Chen et al., 2024, Fan et al., 25 Dec 2025).

Typical optimizers include AdamW or SGD, with Dice + cross-entropy losses for segmentation, and task-family standard protocols (Polylr, CosineAnnealingLR, ReduceLROnPlateau) (Liao et al., 2024, Wang et al., 2024, Bao et al., 25 Mar 2025). In speech tasks, perceptual and mask reconstruction are used.

7. Outlook and Future Research Directions

Mamba-UNet architectures have redefined the design space for U-Net–style modeling, bridging the gap between expensive self-attention and locality-restricted convolutions.

Active research challenges and prospects include:

Extending to Volumetric and Multi-modal Data: Volumetric SSM modules for 3D segmentation with minimal spatial flattening (Fan et al., 25 Dec 2025, Xie et al., 21 Mar 2025).
Efficient Multi-modal Fusion: Combining Mamba branches with large foundation vision models (e.g., SAM) via dynamic fusers and attention blocks (Huo et al., 22 May 2025).
Multi-scale and Small-structure Optimization: Improved diagonal/local modeling, scale-aware upsampling (e.g., LKPE), and finer anatomical boundary recovery (Chen et al., 2024, Wang et al., 2024).
Self-supervised and Weakly-supervised Regimes: Masked-image modeling, cross-pseudo supervision, and contrastive regularizers for annotation-scarce settings (Ma et al., 2024, Wang et al., 2024).
Ultra-lightweight and Edge Deployment: Aggressive channel-parallel splitting and hybrid KD for real-time, point-of-care, and mobile inference scenarios (Wu et al., 2024, Fan et al., 25 Dec 2025).
Adaptive Attention/SSM Fusion: Integration of selective attention blocks, learnable skip-scaling, and dynamic module placement (Zhang et al., 2024, Bao et al., 25 Mar 2025).
Exploring SSM–Transformer Hybrids: Hybrid U-Nets leveraging the locality of convolution, global context capture of SSMs (Mamba), and adaptive locality of windowed Transformers, balancing complexity and accuracy (Zhang et al., 2024, Xie et al., 21 Mar 2025).

The Mamba-UNet design space continues to catalyze research in high-accuracy, computationally efficient neural architectures for a broad range of scientific, biomedical, and engineering applications. The paradigm demonstrates that state-space approaches, when judiciously integrated into multi-scale encoder–decoder designs, set a new benchmark for the fusion of global context and local detail at scale.