Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diffusion-Empowered AutoMedSAM

Updated 25 November 2025
  • The paper introduces AutoMedSAM, an end-to-end framework that automates semantic segmentation via a diffusion-based dual-branch prompt encoder.
  • It leverages a joint uncertainty-aware multi-loss strategy and adapts MedSAM’s backbone to optimize class-specific mask prediction.
  • Empirical evaluations across CT, MR, and X-ray modalities demonstrate superior segmentation accuracy and robust cross-dataset generalization.

Diffusion-Empowered AutoPrompt MedSAM (AutoMedSAM) is an end-to-end medical image segmentation framework that extends the Segment Anything Model (SAM) and its medical adaptation, MedSAM. Addressing the notable challenges of manual prompt dependency and lack of semantic labeling in conventional MedSAM, AutoMedSAM integrates a diffusion-based dual-branch prompt encoder to automate class-conditioned segmentation. This framework enables fully automated mask prediction with semantic association, optimized via a joint uncertainty-aware multi-loss strategy, and demonstrates superior segmentation accuracy and generalization across multiple clinical imaging modalities (Huang et al., 5 Feb 2025).

1. Architecture Overview

AutoMedSAM retains the architectural backbone of MedSAM, composed of a frozen image encoder EIE_I and a mask decoder DMD_M, while fundamentally replacing the manual prompt encoder with a diffusion-based class prompt encoder EPE_P. The input image I∈Rh×w×3I\in\mathbb R^{h\times w\times3} is encoded to feature maps: FI=EI(I),FI∈RB×C×H×W.F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}. Given an anatomical class index cc, the encoder EPE_P generates two prompt embeddings: (Ps(c),Pd(c))=EP(FI,c),(P_s^{(c)}, P_d^{(c)}) = E_P(F_I, c), where Ps(c)P_s^{(c)} (sparse prompt) encodes global cues and Pd(c)P_d^{(c)} (dense prompt) encodes local features. The mask decoder DMD_M0 then combines the image features, positional embedding DMD_M1, and both prompt vectors to predict the segmentation mask: DMD_M2 This pipeline eliminates the need for manual clicks, boxes, or scribbles and embeds semantic class information directly into the segmentation masks (Huang et al., 5 Feb 2025).

2. Diffusion-Based Class Prompt Encoder Design

AutoMedSAM’s class prompt encoder DMD_M3 operates as a conditional diffusion model. The class index DMD_M4 is projected and reshaped for conditioning: DMD_M5 For forward diffusion, isotropic Gaussian noise DMD_M6 with DMD_M7 is added to the image feature,

DMD_M8

This forms the noisy, class-conditioned embedding.

The reverse diffusion employs a U-Net structure, processing DMD_M9 through convolutional layers with class re-injection at each layer. The encoder branches into:

  • Dense/local branch: Element-wise attention is computed,

EPE_P0

followed by masked feature multiplication and upsampling to produce EPE_P1.

  • Sparse/global branch: Channel attention leverages spatially average-pooled features,

EPE_P2

and produces EPE_P3 via channelwise scaling.

Final prompt embeddings are typically concatenated: EPE_P4 This enables integration of both fine-grained and global context within the prompt representation (Huang et al., 5 Feb 2025).

3. Prompt Integration and Segmentation Mask Generation

The mask decoder EPE_P5 incorporates prompt embeddings EPE_P6 via cross-attention mechanisms: EPE_P7

EPE_P8

Semantic prompt features are injected into the decoder’s latent space, ensuring that output masks EPE_P9 encode both object shape and class semantics. This design provides fully automated semantic segmentation for specified anatomical classes, broadening utility for both clinical and non-expert contexts (Huang et al., 5 Feb 2025).

4. Joint Training with Uncertainty-Aware Loss Balancing

AutoMedSAM is optimized with a joint objective comprising five loss components:

  1. Sparse prompt MSE:

I∈Rh×w×3I\in\mathbb R^{h\times w\times3}0

  1. Dense prompt MSE:

I∈Rh×w×3I\in\mathbb R^{h\times w\times3}1

  1. Dice loss:

I∈Rh×w×3I\in\mathbb R^{h\times w\times3}2

  1. Cross-entropy loss:

I∈Rh×w×3I\in\mathbb R^{h\times w\times3}3

  1. Shape-distance loss:

I∈Rh×w×3I\in\mathbb R^{h\times w\times3}4

Loss terms are dynamically weighted using the uncertainty weighting framework of Tsai et al.: I∈Rh×w×3I\in\mathbb R^{h\times w\times3}5 This obviates manual tuning of loss weights and facilitates balanced learning across heterogeneous objectives (Huang et al., 5 Feb 2025).

5. Training Procedure

During training, the image encoder I∈Rh×w×3I\in\mathbb R^{h\times w\times3}6 remains frozen while I∈Rh×w×3I\in\mathbb R^{h\times w\times3}7 and I∈Rh×w×3I\in\mathbb R^{h\times w\times3}8 are updated. Optimization employs AdamW with learning rate 5e-4, I∈Rh×w×3I\in\mathbb R^{h\times w\times3}9, FI=EI(I),FI∈RB×C×H×W.F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}.0, and FI=EI(I),FI∈RB×C×H×W.F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}.1, using a reduce-on-plateau scheduler (factor 0.9, patience 5), batch size 16, up to 100 epochs. The core process follows:

FI=EI(I),FI∈RB×C×H×W.F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}.2 This strategy enables efficient transfer of MedSAM’s pre-trained image representations while adapting the prompt and mask decoder modules to the fully automated, class-specific task (Huang et al., 5 Feb 2025).

6. Empirical Evaluation

AutoMedSAM is evaluated across diverse medical imaging datasets: AbdomenCT-1K (CT, 5 organs), BraTS (MR-FLAIR, tumor), Kvasir-SEG (endoscopy, polyp), Chest-XML (X-ray, lung), and in cross-dataset scenarios (AMOS, BraTS-T1CE). Performance is measured using Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD).

Representative Quantitative Results (AbdomenCT-1K):

Method DSC (%) NSD (%)
MedSAM 93.505 92.969
SurgicalSAM 75.505 70.119
AutoMedSAM (O) 94.580 95.148

On single-object datasets (BraTS, Kvasir, Chest-XML), AutoMedSAM outperforms all baselines by 1–5 points in DSC and NSD. Cross-dataset evaluation (train: AbdomenCT, test: AMOS) yields DSC 71.14% for AutoMedSAM vs. 56.93% for SurgicalSAM. Ablation studies demonstrate the benefits of dual-branch prompts, diffusion, and uncertainty weighting (Huang et al., 5 Feb 2025).

7. Strengths, Limitations, and Future Directions

AutoMedSAM delivers a fully automated, semantically labeled segmentation workflow, eliminating manual prompt annotation and enabling class-aware mask generation. The dual-branch diffusion encoder captures both global and local context, and uncertainty weighting harmonizes joint optimization. Nevertheless, computational overhead from diffusion steps is nontrivial, and current deployments require a predefined class index set, precluding open-vocabulary extension. There may be performance degradation on extremely small or highly noisy structures. Future work will target lightweight diffusion models, open-set recognition, and scaling to 3D volumetric data.

AutoMedSAM establishes a state-of-the-art, prompt-free, and semantically explicit segmentation paradigm for clinical and non-expert end users (Huang et al., 5 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diffusion-Empowered AutoPrompt MedSAM.