UltraLBM-UNet: Ultralight Lesion Segmentation
- The paper introduces UltraLBM-UNet, an ultralight U-Net architecture that integrates bidirectional Mamba-based global context and multi-branch local feature perception to achieve high segmentation fidelity.
- Its six-stage encoder–decoder design, featuring variable kernel sizes and shared Mamba weights, delivers robust performance with over 79% IoU and 88% DSC on benchmarks like ISIC17 and PH².
- The architecture supports real-time deployment in constrained environments with sub-0.06 GFLOPs and a memory footprint under 0.14 MB, validated through extensive ablation studies and hybrid knowledge distillation.
UltraLBM-UNet is an ultralight variant of the U-Net architecture designed for high-performance and resource-efficient skin lesion segmentation. It incorporates a bidirectional Mamba-based global modeling mechanism with multi-branch local feature perception, resulting in a model that delivers robust segmentation accuracy with extremely low computational complexity. The architecture supports deployment in point-of-care scenarios, where memory and inference latency are critical constraints, without sacrificing segmentation fidelity (Fan et al., 25 Dec 2025).
1. Architectural Principles and Configuration
UltraLBM-UNet utilizes a six-stage encoder–decoder configuration. The encoder channels are [8, 16, 24, 32, 48, 64], mirrored in the decoder, enabling hierarchical feature extraction. Early encoder stages (I–III) employ conventional Conv–ReLU–Conv–ReLU blocks with max-pooling for local representation learning. Shallow encoder stage III integrates a Local Multi-Branch Perception (LMBP) module comprising three DwConv branches (kernel sizes: 3, 5, 7 dependent on depth) and an identity branch, emphasizing local feature detail.
Deep encoder stages (IV–VI) and the first three decoder stages feature Global–Local Multi-Branch Perception (GLMBP) modules. These modules merge bidirectional Mamba-based global context with depthwise-separable convolution branches, balancing spatial reach and edge preservation. Skip-connections are implemented via element-wise addition with a learnable scalar scale factor , expressed as . Bilinear interpolation is used for upsampling in the decoder.
The distilled model, UltraLBM-UNet-T, retains an identical topology but halves all channel widths ([4, 8, 12, 16, 24, 32]), lowering both parameter count and FLOPs for further resource efficiency.
2. Bidirectional Mamba-Based Global Modeling
UltraLBM-UNet exploits the Mamba state-space model (SSM) for linear-time, long-range dependency modeling within feature maps. Given a feature map , the tensor is flattened to a sequence of length . LayerNorm is applied, and channels are split into four parts, .
Branches 1 and 2 serve global context extraction via Mamba modules in bidirectional configuration: Feature fusion is performed by , with a learnable scalar. Shared Mamba weights ensure parameter efficiency while doubling contextual information. This strategy enables non-causal global modeling over the spatial domain of the image.
3. Multi-Branch Local Feature Perception and Multi-Receptive-Field Design
Branch 3 processes local features via DwConv, reshaping to 2D, then computing . Branch 4 employs an identity shortcut: . Final multi-branch fusion concatenates outputs along the channel dimension: .
Multi-receptive-field design is enforced via variable kernel sizes in GLMBP modules:
- Encoder stages IV, V, VI: DwConv kernels
- Decoder stages I, II, III: kernels This composition maximizes localization detail while maintaining parameter compactness.
4. Hybrid Knowledge Distillation to UltraLBM-UNet-T
UltraLBM-UNet-T is trained using a hybrid knowledge distillation regime from the full model. The total loss is: Where:
- is the standard BCE + Dice loss on segmentation outputs,
- (Decoupled Knowledge Distillation) aligns pixel probability distributions via KL divergence,
- (Attention Transfer) aligns spatial attention maps,
- matches gradient boundaries via Sobel edge operator between student and teacher predictions.
Ablations on the ISIC2017 dataset demonstrate incremental improvements: base student (w/o distillation) achieves 77.30% IoU / 87.20% DSC, DKD only increases to 78.19%/87.76%, and full loss yields 78.57% IoU / 88.00% DSC.
5. Computational Complexity and Resource Profiles
Model complexity is strictly constrained:
- UltraLBM-UNet: 0.034M parameters (≈0.14 MB), 0.060 GFLOPs for inputs.
- UltraLBM-UNet-T: 0.011M parameters (≈0.044 MB), 0.019 GFLOPs.
Module breakdown: | Module | Params (M) | FLOPs (G) | |------------------------------------|:----------:|:---------:| | Stem & shallow encoder (I–III) | 0.012 | 0.020 | | Encoder/decoder GLMBP (6 total) | 0.016 | 0.030 | | Skip-scale, upsampling, LayerNorm | 0.006 | 0.010 |
Editor's term: "Resource profiles" — UltraLBM-UNet and its student variant require exceedingly small memory and compute, making them suitable for real-time operation on embedded GPUs (e.g., NVIDIA Jetson, mobile NPUs) with sub-10 ms inference and <1 MB footprint.
6. Segmentation Performance Benchmarks
Extensive evaluation was conducted on ISIC2017, ISIC2018, and PH² datasets. The following table presents segmentation accuracy and resource usage for existing lightweight models and UltraLBM-UNet variants:
| Model | ISIC17 IoU/DSC | ISIC18 IoU/DSC | PH² IoU/DSC | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|
| MAL-UNet | 78.71/88.09 | 79.42/88.53 | 83.83/91.20 | 0.178 | 0.083 |
| EGE-UNet | 78.32/87.84 | 79.45/88.55 | 83.36/90.93 | 0.053 | 0.072 |
| UltraLight VM-UNet | 77.93/87.59 | 78.93/88.23 | 83.31/90.89 | 0.045 | 0.069 |
| UltraLBM-UNet-T | 78.57/88.00 | 78.82/88.15 | 84.92/91.85 | 0.011 | 0.019 |
| UltraLBM-UNet | 79.82/88.78 | 79.94/88.85 | 84.41/91.54 | 0.034 | 0.060 |
On ISIC2017, UltraLBM-UNet exhibits the highest IoU (79.82%) and DSC (88.78%) among all sub-1M parameter models. On PH², the distilled student model achieves the top average IoU/DSC (84.92%/91.85%) (Fan et al., 25 Dec 2025). This suggests that ultra-compact architectures can match or surpass larger models in segmentation fidelity when equipped with efficient context modeling and knowledge transfer strategies.
7. Deployment Advantage and Design Rationale
UltraLBM-UNet is explicitly designed for point-of-care scenarios requiring high throughput, low latency, and minimal compute resources. Model weights of <0.14 MB and GFLOPs of ≤0.06 enable stable operation on low-power devices, including embedded platforms and mobile processors. Absence of matrix-heavy self-attention ensures deterministic and predictable resource usage. Multi-branch identity shortcuts stabilize gradients and preserve detail, while learnable skip-scales and fixed branching favour fixed-point inferencing (e.g., for INT8 quantization).
Balanced fusion between bidirectional Mamba-based global modeling and multi-kernel depthwise convolution supports robust edge detection and contextual reasoning. Shared Mamba weights maintain model simplicity despite bidirectional traversal. Hybrid knowledge distillation further compresses the student model without runtime trade-off, transferring structural, spatial, and boundary cues from the teacher.
UltraLBM-UNet thus establishes a new Pareto frontier for model accuracy versus resource cost in skin lesion segmentation, specifically tailored for real-time clinical and embedded deployment contexts (Fan et al., 25 Dec 2025).