DSVM-UNet: Dual Self-distillation in VM-UNet
- The paper introduces dual self-distillation to enforce global semantic consistency and local feature alignment, boosting segmentation accuracy.
- It employs Vision-State-Space blocks within a U-Net framework to efficiently capture long-range spatial dependencies.
- Benchmark results on ISIC and Synapse datasets demonstrate improved Dice, sensitivity, and computational efficiency without extra inference cost.
Dual Self-distillation for VM-UNet (DSVM-UNet) is a feature alignment strategy integrated into Vision Mamba-based U-shaped networks for medical image segmentation. DSVM-UNet enhances feature learning at multiple scales by leveraging two intra-network self-distillation mechanisms—global and local—without increasing inference complexity or parameter count. By imposing global semantic consistency and enforcing local representational smoothness, DSVM-UNet delivers consistent improvements on standard segmentation benchmarks while maintaining high computational efficiency (Shao et al., 27 Jan 2026).
1. Architectural Foundation: VM-UNet and Vision Mamba
The underlying backbone, VM-UNet, adheres to the canonical U-Net paradigm with a symmetric encoder–decoder architecture and skip connections. The distinctive element of VM-UNet is its integration of Vision-State-Space (VSS) blocks, inherited from the Vision Mamba family, throughout both encoder and decoder pathways. VSS blocks model long-range spatial dependencies by treating the tokenized 2D input as a state-space system, enabling linear-time global context modeling: with appropriately parameterized matrices .
After a patch-embedding layer that tokenizes an input , the encoder computes feature tensors , and the decoder produces at each stage : Final segmentation logits are derived from using a projection.
2. Motivation for Dual Self-distillation
Previous VM-UNet variants improved segmentation, predominantly by architectural expansion—depth, width, or complex skip connections. However, these elaborations alone do not guarantee robust feature alignment or semantic consistency between layers; intermediate features can remain redundant or inconsistent. DSVM-UNet circumvents this by introducing self-distillation, forcing shallower layers to learn from deeper, semantically richer representations. Dual self-distillation is applied via:
- Global (projection) self-distillation: All encoder and decoder features are projected and aligned with the deepest decoder feature map.
- Local (progressive) self-distillation: Adjacent encoder and decoder stages are trained for local consistency via feature alignment.
This dual mechanism enriches multi-scale feature representations without augmenting the network's inference graph or parameter footprint.
3. Formal Definition of Dual Self-distillation Losses
Let denote the stage count; , are stage- encoder and decoder features.
3.1. Global (Projection) Feature Alignment
All stage features undergo linear projection and channel reduction to produce
where rescales to fixed spatial size and reduces channels.
The global distillation objective aligns each projected feature with the deepest decoder output:
3.2. Local (Progressive) Feature Alignment
For every adjacent stage pair, the deeper feature is upsampled and channel-matched: with the resulting local loss:
3.3. Standard Segmentation Loss
For binary segmentation, a balanced sum of binary cross-entropy () and Dice loss () is used: For multiclass, is replaced by cross-entropy.
3.4. Composite Training Objective
Empirically, , . No soft-label or temperature scaling is applied—distillation operates via direct MSE on raw feature maps.
4. Pipeline Integration and Training Protocol
- After each VSS block, both in encoder and decoder, features are tapped for distillation.
- Projection distillation compares all stages to the deepest decoder feature.
- Progressive distillation connects every adjacent pair.
- Distillation terms act solely during training; the inference-time model remains identical to baseline VM-UNet, preserving computational efficiency.
Training configuration uses batch size 32, AdamW optimizer with cosine annealing (), and 300 epochs. Pretrained VMamba-S weights (ImageNet-1k) initialize encoder/decoder parameters. All experiments are implemented in PyTorch and executed on NVIDIA RTX A40 hardware.
5. Evaluation on Medical Segmentation Benchmarks
5.1. Datasets and Metrics
| Benchmark | Modality | Size / Tasks | Metrics |
|---|---|---|---|
| ISIC2017 | Dermoscopy (2D) | 2,750 images | mean IoU, Dice, Accuracy, Spec., Sens. |
| ISIC2018 | Dermoscopy (2D) | 2,594 images | same as above |
| Synapse | Multi-organ CT | 30 volumes (8 organs) | avg. Dice, HD95 |
Augmentation includes random flips and rotations.
5.2. Main Results
| Dataset | mIoU/DSC | Accuracy | Specificity | Sensitivity | HD95 |
|---|---|---|---|---|---|
| ISIC2017 | 82.57% / 90.62% | — | 98.34% | 92.08% | — |
| ISIC2018 | 81.51% / 90.45% | 95.43% | 97.66% | 91.26% | — |
| Synapse | — / 81.68% | — | — | — | 19.32 px |
Relative to VM-UNetV2, DSVM-UNet yields up to +0.4% Dice, up to +2.6% Sensitivity, and lower Hausdorff distances (Shao et al., 27 Jan 2026).
Ablation on ISIC2018 reveals that global-only and local-only self-distillation both improve mIoU over baseline, with the dual approach providing the highest gain (from 81.31% to 81.51%).
6. Computational Efficiency
On inputs:
| Model | Params (M) | FLOPs (G) |
|---|---|---|
| VM-UNet | 27.42 | 4.11 |
| VM-UNetV2 | 22.77 | 4.40 |
| DSVM-UNet | 22.63 | 3.65 |
DSVM-UNet matches or improves accuracy with marginally lower FLOPs and no additional parameter overhead, due to all distillation being confined to training (Shao et al., 27 Jan 2026).
7. Discussion, Limitations, and Prospective Extensions
DSVM-UNet demonstrates that explicitly aligning hierarchical feature representations via dual self-distillation is effective for maximizing the representational capacity of Vision Mamba-based U-Nets without architectural bloat. The approach secures global semantic consistency and local granularity, aligning well with the structural priors of medical images.
Current limitations include reliance on MSE for distillation. A plausible implication is that employing more advanced perceptual or attention-based distillation mechanisms could further refine feature consistency. Only intra-network (self) distillation is addressed; cross-model or cross-modal variants remain open for exploration. Future directions include soft-label distillation with temperature scaling, adapting the method to 3D volumetric and multi-modal segmentation, and learning adaptive per-layer or per-task weights for distillation components (Shao et al., 27 Jan 2026).
Table: Summary of Key Distillation Mechanisms in DSVM-UNet
| Mechanism | Alignment Target | Loss Type |
|---|---|---|
| Global (projection) | All encoder/decoder stages | MSE |
| Local (progressive) | Each adjacent stage pair | MSE |
This table organizes the two primary self-distillation pathways in DSVM-UNet.
DSVM-UNet defines a state-of-the-art training regime for Vision Mamba-based medical image segmentation, achieving high accuracy with efficient architectures and direct, multi-scale feature alignment (Shao et al., 27 Jan 2026).