Papers
Topics
Authors
Recent
Search
2000 character limit reached

DB-MSMUNet: Pancreatic CT Segmentation

Updated 15 January 2026
  • The paper achieves state-of-the-art pancreas segmentation metrics by integrating deformable convolutions, dual decoders, and auxiliary deep supervision.
  • DB-MSMUNet is designed with a U-shaped encoder–decoder and Multi-scale Mamba Modules, enabling robust feature extraction and handling low tissue contrast.
  • Empirical evaluations across multiple datasets show improved Dice, precision, and recall scores over conventional UNet and transformer-based methods.

DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet) is a neural network architecture developed for robust and precise segmentation of the pancreas and its lesions in CT images, addressing key challenges such as low tissue contrast, blurry anatomical boundaries, irregular organ shapes, and the small size of pathological regions. This model integrates deformable convolutions, structured state-space modeling, dual decoder branches, and auxiliary deep supervision to enhance segmentation accuracy, edge preservation, and robustness across multiple datasets (Guan et al., 8 Jan 2026).

1. Architectural Overview

The core of DB-MSMUNet is a U-shaped encoder–decoder structure featuring two parallel decoders, termed the @@@@1@@@@ (EEP) and the Multi-layer Decoder (MLD). The pipeline consists of:

  • Stem block: 3×3 convolution (stride 1), batch normalization, ReLU activation, followed by 2×2 max pooling, extracting the initial feature representation from each input slice.
  • Encoder Backbone: Four sequential Multi-scale Mamba Modules (MSMM₁ to MSMM₄), each producing progressively deeper feature maps (denoted {X1,,X4}\{X_1,\ldots,X_4\}) at increasingly reduced spatial resolutions.
  • Dual Decoders:
    • EEP: Specializes in extracting and refining organ boundary cues.
    • MLD: Focuses on reconstructing high-resolution area maps, recovering detailed small lesion structures via multi-scale fusion.
  • Auxiliary Deep Supervision (ADS): Prediction heads at multiple scales in both decoders inject early gradient feedback to stabilize and improve training, especially relevant for small and hard-to-segment regions.

This dual-path approach explicitly decouples contour localization from region segmentation, efficiently leveraging both global anatomical context and fine local cues.

2. Multi-scale Mamba Module (MSMM)

The MSMM is the fundamental component of the encoder, employing deformable convolutions in tandem with multi-scale structured state-space (Mamba) layers and residual connections.

  • Deformable Convolutions: For input feature XiX_i, three separate deformable convolutions with kernel sizes 3×33\times3, 5×55\times5, and 7×77\times7 generate global context streams Ui,kU_{i,k}, defined as:

Ui,k=Fdefk×k(Xi),k=3,5,7U_{i,k} = F^{\mathrm{def}_{k\times k}}(X_i), \quad k=3,5,7

Each output adapts the receptive field dynamically for anatomical variations.

  • Structured State-Space (Mamba) Layers: Each Ui,kU_{i,k} is processed by two Mamba-SSM layers to capture long-range spatial dependencies. The discrete SSM dynamics are given by:

ht+1=Aˉht+Bˉxt,yt=Chth_{t+1} = \bar{A} h_t + \bar{B} x_t, \quad y_t = C h_t

with Aˉ=eΔA\bar{A} = e^{\Delta A}, Bˉ=(ΔA)1(eΔAI)ΔB\bar{B} = (\Delta A)^{-1}(e^{\Delta A}-I)\Delta B.

  • Local Residual Stream: A parallel residual block extracts localized details, Li=ResBlock(Xi)L_i = \mathrm{ResBlock}(X_i).
  • Aggregation: The next encoder output is obtained by channel-wise concatenation:

Xi+1=Concat(Li,  Gi,3,  Gi,5,  Gi,7)X_{i+1} = \mathrm{Concat}(L_i,\;G_{i,3},\;G_{i,5},\;G_{i,7})

where Gi,kG_{i,k} is the output of the Mamba layers for each scale.

This hybrid design enables the encoder to simultaneously attend to both localized deformations and global contextual structure.

3. Dual Decoder Composition

3.1 Edge Enhancement Path (EEP)

The EEP is tailored to address imprecise or ambiguous anatomical boundaries typical of pancreatic CT scans:

  1. Attention Gating: An attention mechanism selects edge-relevant activations via a series of 1×11\times1 convolutions and a sigmoid mapping:

Ai=σ(Conv1×1(ReLU(Conv1×1(Xi)+Xi)))A_i = \sigma\Bigl(\mathrm{Conv}_{1\times1}\bigl(\mathrm{ReLU}(\mathrm{Conv}_{1\times1}(X_i)+X_i)\bigr)\Bigr)

The masked output X~i=AiXi\tilde{X}_i = A_i \odot X_i is then forwarded.

  1. Boundary Refinement: A sequence of two residual blocks, with inter-scale fusion, sharpens boundaries and improves continuity.
  2. Edge Prediction: The final edge map PeP_e is produced and supervised by an edge-aware binary cross-entropy loss against Canny-detected targets:

Ledge=x,y[w0E(x,y)logPe(x,y)+w1(1E(x,y))log(1Pe(x,y))]\mathcal{L}_\mathrm{edge} = \sum_{x,y}\left[w_0\,E(x,y)\,\log P_e(x,y) + w_1(1-E(x,y))\log(1-P_e(x,y))\right]

Auxiliary edge losses at multiple scales facilitate sharper and more consistent contour learning.

3.2 Multi-layer Decoder (MLD)

MLD targets precise area segmentation, especially for small lesions and irregular morphologies:

  • Dual-Attention Processing: Each XiX_i is processed by a dual-attention module, followed by a Multi-scale Dilated Re-parameterization Block (MSDRB) using cascaded dilated convolutions (9,7,5)(9,7,5).
  • Upsampling and Fusion: Each scale's output is upsampled (bilinear), then all four are concatenated and fused by a 1×11\times1 convolution to generate the final region segmentation:

Pa=Conv1×1(Concat(D1,,D4))P_a = \mathrm{Conv}_{1\times1}(\mathrm{Concat}(D_1,\ldots,D_4))

  • Loss Function: The segmentation mask is supervised with a Dice-loss:

Larea=12PaTPa+T\mathcal{L}_\mathrm{area} = 1 - \frac{2\sum P_a\,T}{\sum P_a + \sum T}

Auxiliary losses at each decoder level promote effective gradient flow.

4. Multi-scale Deep Supervision and Optimization

Auxiliary Deep Supervision (ADS) attaches prediction heads after scales 2, 3, and 4 on both decoders, providing explicit losses for both edge and area outputs. The total training loss is

Ltotal=Larea+s=24wsLaux_area(s)+Ledge+s=24wsLaux_edge(s)\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{area} + \sum_{s=2}^4 w_s\,\mathcal{L}_\mathrm{aux\_area}^{(s)} + \mathcal{L}_\mathrm{edge} + \sum_{s=2}^4 w_s\,\mathcal{L}_\mathrm{aux\_edge}^{(s)}

with w2=α,w3=β,w4=γw_2 = \alpha, w_3 = \beta, w_4 = \gamma.

The model is trained using the AdamW optimizer with cosine-annealing for 300 epochs (batch size 14, initial lr=5×104lr=5 \times 10^{-4}). Data augmentations include spatial flips, rotations, Gaussian noise, contrast jitter, and histogram shift. Hounsfield Unit values are clipped and normalized per dataset specification.

5. Empirical Evaluation and Comparison

Experiments span three datasets:

  • NIH Pancreas: 82 CTs, 7,309 slices
  • MSD Pancreas: 281 CTs, 9,073 slices
  • Clinical Tumor: 89 CTs, 1,476 slices

Results are reported as mean Dice, Precision, and Recall across four-fold cross-validation.

Model NIH (D/P/R) MSD (D/P/R) Clinical (D/P/R)
UNet 80.14/83.64/78.64 81.46/83.64/78.64 77.03/83.33/82.24
nnU-Net 85.34/85.68/88.32 85.38/87.12/88.07 85.91/87.06/89.11
TransUNet 83.18/84.84/89.15 82.58/85.69/87.01 80.82/91.05/85.30
VM-UNet 82.71/84.28/89.52 84.27/85.87/86.45 83.87/91.52/85.64
U-Mamba 85.31/87.10/90.43 84.31/86.17/85.79 84.14/90.71/84.76
SliceMamba 87.09/88.01/90.13 86.01/88.25/86.98 85.34/90.99/85.53
DB-MSMUNet 89.47/90.24/92.04 87.59/88.98/89.02 89.02/92.34/91.72

Ablation analysis on the NIH dataset demonstrates the individual and cumulative benefit of each architectural component:

Configuration Dice (%)
MSMM only 86.23
EEP+MLD 86.19
MSMM+MLD 86.68
MSMM+EEP 87.95
MSMM+EEP+MLD 88.99
Full model 89.47

This shows that each module (MSMM, EEP, MLD) contributes additively to the final segmentation accuracy.

6. Component-Level Analysis and Significance

  • Multi-scale Mamba Module (MSMM): The integration of deformable convolutions enables geometric adaptation to irregular pancreas morphologies, while Mamba SSM blocks encode global dependencies vital for overcoming regions of indistinct contrast.
  • Edge Enhancement Path (EEP): EEP provides explicit modeling and supervision of boundary cues, directly addressing fuzzy or indistinguishable organ contours.
  • Multi-layer Decoder (MLD): MLD's multi-scale upsampling and attentive fusion preserve subtle lesion details and accommodate the complex shape variability of the pancreas.
  • Auxiliary Deep Supervision (ADS): Early and intermediate deep supervision stabilizes training, crucially benefiting the segmentation of small or challenging targets.

The architectural synergy among these elements is directly reflected in superior segmentation metrics and qualitative outputs. The design demonstrates high generalizability and robustness across both public and clinical CT datasets.

7. Implications and Generalizability

DB-MSMUNet achieves state-of-the-art segmentation accuracy on challenging pancreatic CT data, outperforming prior UNet variants, transformer-based, and Mamba-based frameworks (Guan et al., 8 Jan 2026). The dual-branch model structure and multi-scale design are specifically beneficial where target structures are small, ambiguous, or heavily deformed. A plausible implication is potential extensibility of DB-MSMUNet (or its core modules) to other medical image segmentation tasks characterized by similar anatomical variability and low-contrast tissues. The model's quantitative and ablation outcomes substantiate its architectural innovations as meaningful advances in clinical segmentation pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DB-MSMUNet.