DB-MSMUNet: Pancreatic CT Segmentation
- The paper achieves state-of-the-art pancreas segmentation metrics by integrating deformable convolutions, dual decoders, and auxiliary deep supervision.
- DB-MSMUNet is designed with a U-shaped encoder–decoder and Multi-scale Mamba Modules, enabling robust feature extraction and handling low tissue contrast.
- Empirical evaluations across multiple datasets show improved Dice, precision, and recall scores over conventional UNet and transformer-based methods.
DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet) is a neural network architecture developed for robust and precise segmentation of the pancreas and its lesions in CT images, addressing key challenges such as low tissue contrast, blurry anatomical boundaries, irregular organ shapes, and the small size of pathological regions. This model integrates deformable convolutions, structured state-space modeling, dual decoder branches, and auxiliary deep supervision to enhance segmentation accuracy, edge preservation, and robustness across multiple datasets (Guan et al., 8 Jan 2026).
1. Architectural Overview
The core of DB-MSMUNet is a U-shaped encoder–decoder structure featuring two parallel decoders, termed the @@@@1@@@@ (EEP) and the Multi-layer Decoder (MLD). The pipeline consists of:
- Stem block: 3×3 convolution (stride 1), batch normalization, ReLU activation, followed by 2×2 max pooling, extracting the initial feature representation from each input slice.
- Encoder Backbone: Four sequential Multi-scale Mamba Modules (MSMM₁ to MSMM₄), each producing progressively deeper feature maps (denoted ) at increasingly reduced spatial resolutions.
- Dual Decoders:
- EEP: Specializes in extracting and refining organ boundary cues.
- MLD: Focuses on reconstructing high-resolution area maps, recovering detailed small lesion structures via multi-scale fusion.
- Auxiliary Deep Supervision (ADS): Prediction heads at multiple scales in both decoders inject early gradient feedback to stabilize and improve training, especially relevant for small and hard-to-segment regions.
This dual-path approach explicitly decouples contour localization from region segmentation, efficiently leveraging both global anatomical context and fine local cues.
2. Multi-scale Mamba Module (MSMM)
The MSMM is the fundamental component of the encoder, employing deformable convolutions in tandem with multi-scale structured state-space (Mamba) layers and residual connections.
- Deformable Convolutions: For input feature , three separate deformable convolutions with kernel sizes , , and generate global context streams , defined as:
Each output adapts the receptive field dynamically for anatomical variations.
- Structured State-Space (Mamba) Layers: Each is processed by two Mamba-SSM layers to capture long-range spatial dependencies. The discrete SSM dynamics are given by:
with , .
- Local Residual Stream: A parallel residual block extracts localized details, .
- Aggregation: The next encoder output is obtained by channel-wise concatenation:
where is the output of the Mamba layers for each scale.
This hybrid design enables the encoder to simultaneously attend to both localized deformations and global contextual structure.
3. Dual Decoder Composition
3.1 Edge Enhancement Path (EEP)
The EEP is tailored to address imprecise or ambiguous anatomical boundaries typical of pancreatic CT scans:
- Attention Gating: An attention mechanism selects edge-relevant activations via a series of convolutions and a sigmoid mapping:
The masked output is then forwarded.
- Boundary Refinement: A sequence of two residual blocks, with inter-scale fusion, sharpens boundaries and improves continuity.
- Edge Prediction: The final edge map is produced and supervised by an edge-aware binary cross-entropy loss against Canny-detected targets:
Auxiliary edge losses at multiple scales facilitate sharper and more consistent contour learning.
3.2 Multi-layer Decoder (MLD)
MLD targets precise area segmentation, especially for small lesions and irregular morphologies:
- Dual-Attention Processing: Each is processed by a dual-attention module, followed by a Multi-scale Dilated Re-parameterization Block (MSDRB) using cascaded dilated convolutions .
- Upsampling and Fusion: Each scale's output is upsampled (bilinear), then all four are concatenated and fused by a convolution to generate the final region segmentation:
- Loss Function: The segmentation mask is supervised with a Dice-loss:
Auxiliary losses at each decoder level promote effective gradient flow.
4. Multi-scale Deep Supervision and Optimization
Auxiliary Deep Supervision (ADS) attaches prediction heads after scales 2, 3, and 4 on both decoders, providing explicit losses for both edge and area outputs. The total training loss is
with .
The model is trained using the AdamW optimizer with cosine-annealing for 300 epochs (batch size 14, initial ). Data augmentations include spatial flips, rotations, Gaussian noise, contrast jitter, and histogram shift. Hounsfield Unit values are clipped and normalized per dataset specification.
5. Empirical Evaluation and Comparison
Experiments span three datasets:
- NIH Pancreas: 82 CTs, 7,309 slices
- MSD Pancreas: 281 CTs, 9,073 slices
- Clinical Tumor: 89 CTs, 1,476 slices
Results are reported as mean Dice, Precision, and Recall across four-fold cross-validation.
| Model | NIH (D/P/R) | MSD (D/P/R) | Clinical (D/P/R) |
|---|---|---|---|
| UNet | 80.14/83.64/78.64 | 81.46/83.64/78.64 | 77.03/83.33/82.24 |
| nnU-Net | 85.34/85.68/88.32 | 85.38/87.12/88.07 | 85.91/87.06/89.11 |
| TransUNet | 83.18/84.84/89.15 | 82.58/85.69/87.01 | 80.82/91.05/85.30 |
| VM-UNet | 82.71/84.28/89.52 | 84.27/85.87/86.45 | 83.87/91.52/85.64 |
| U-Mamba | 85.31/87.10/90.43 | 84.31/86.17/85.79 | 84.14/90.71/84.76 |
| SliceMamba | 87.09/88.01/90.13 | 86.01/88.25/86.98 | 85.34/90.99/85.53 |
| DB-MSMUNet | 89.47/90.24/92.04 | 87.59/88.98/89.02 | 89.02/92.34/91.72 |
Ablation analysis on the NIH dataset demonstrates the individual and cumulative benefit of each architectural component:
| Configuration | Dice (%) |
|---|---|
| MSMM only | 86.23 |
| EEP+MLD | 86.19 |
| MSMM+MLD | 86.68 |
| MSMM+EEP | 87.95 |
| MSMM+EEP+MLD | 88.99 |
| Full model | 89.47 |
This shows that each module (MSMM, EEP, MLD) contributes additively to the final segmentation accuracy.
6. Component-Level Analysis and Significance
- Multi-scale Mamba Module (MSMM): The integration of deformable convolutions enables geometric adaptation to irregular pancreas morphologies, while Mamba SSM blocks encode global dependencies vital for overcoming regions of indistinct contrast.
- Edge Enhancement Path (EEP): EEP provides explicit modeling and supervision of boundary cues, directly addressing fuzzy or indistinguishable organ contours.
- Multi-layer Decoder (MLD): MLD's multi-scale upsampling and attentive fusion preserve subtle lesion details and accommodate the complex shape variability of the pancreas.
- Auxiliary Deep Supervision (ADS): Early and intermediate deep supervision stabilizes training, crucially benefiting the segmentation of small or challenging targets.
The architectural synergy among these elements is directly reflected in superior segmentation metrics and qualitative outputs. The design demonstrates high generalizability and robustness across both public and clinical CT datasets.
7. Implications and Generalizability
DB-MSMUNet achieves state-of-the-art segmentation accuracy on challenging pancreatic CT data, outperforming prior UNet variants, transformer-based, and Mamba-based frameworks (Guan et al., 8 Jan 2026). The dual-branch model structure and multi-scale design are specifically beneficial where target structures are small, ambiguous, or heavily deformed. A plausible implication is potential extensibility of DB-MSMUNet (or its core modules) to other medical image segmentation tasks characterized by similar anatomical variability and low-contrast tissues. The model's quantitative and ablation outcomes substantiate its architectural innovations as meaningful advances in clinical segmentation pipelines.