A SAM-guided and Match-based Semi-Supervised Segmentation Framework for Medical Imaging

Published 25 Nov 2024 in cs.CV | (2411.16949v1)

Abstract: This study introduces SAMatch, a SAM-guided Match-based framework for semi-supervised medical image segmentation, aimed at improving pseudo label quality in data-scarce scenarios. While Match-based frameworks are effective, they struggle with low-quality pseudo labels due to the absence of ground truth. SAM, pre-trained on a large dataset, generalizes well across diverse tasks and assists in generating high-confidence prompts, which are then used to refine pseudo labels via fine-tuned SAM. SAMatch is trained end-to-end, allowing for dynamic interaction between the models. Experiments on the ACDC cardiac MRI, BUSI breast ultrasound, and MRLiver datasets show SAMatch achieving state-of-the-art results, with Dice scores of 89.36%, 77.76%, and 80.04%, respectively, using minimal labeled data. SAMatch effectively addresses challenges in semi-supervised segmentation, offering a powerful tool for segmentation in data-limited environments. Code and data are available at https://github.com/apple1986/SAMatch.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents SAMatch, which couples SAM-based segmentation with teacher-student semi-supervised learning to generate high-quality pseudo-labels under limited annotation.
The framework employs a three-module architecture with automatic prompt extraction and joint Dice and cross-entropy loss optimization to enhance segmentation performance.
Experimental results on ACDC, BUSI, and MRLiver datasets show that SAMatch achieves near full-supervision performance and robust boundary localization with improved Dice scores.

A SAM-Guided and Match-Based Semi-Supervised Segmentation Framework for Medical Imaging

Introduction and Background

Semantic segmentation in medical imaging is critical for delineating anatomical structures and pathologies, thereby underpinning robust diagnostic and therapeutic pipelines. Deep neural architectures such as U-Net and DeepLab, predominantly trained in supervised regimes, deliver high-fidelity segmentations but are stymied by the scarcity and costliness of annotated data. Semi-supervised learning (SSL) approaches, especially those based on consistency regularization (e.g., Mean Teacher, FixMatch, UniMatch), mitigate annotation burdens by leveraging unlabeled data via pseudo-labeling. However, the primary failure mode in these frameworks arises from the propagation of low-quality pseudo-labels, which degrade the consistency assumption and segue into suboptimal model calibration.

Recently, the Segment Anything Model (SAM), a large-scale foundation model for segmentation, has demonstrated remarkable generalization across domains when integrated with informative prompts. SAM-based solutions in medical imaging construct high-quality segmentation masks but are critically bottlenecked by the requirement for prompt engineering—typically manual and non-scalable in data-limited clinical contexts. Existing prompt automation efforts for SAM (e.g., AutoSAM, YOLOv8-driven) rely on copious annotated data or are not designed for coupled training synergies with pseudo-label generation pipelines.

Methodology: The SAMatch Framework

The proposed SAMatch framework operationalizes a symbiotic integration of SAM-based models with Match-based semi-supervised pipelines. The architecture is organized into three principal modules: (1) Match-based teacher-student networks, adopting differential augmentations (weak/strong) and mean-teacher weight updates; (2) a fine-tuned SAM-based backbone, tailored for medical domains (e.g., MedSAM variants); and (3) an automatic, differentiable prompt extraction loop.

The training protocol is partitioned into warm-up and interactive phases. In the warm-up phase, standard Match-based training is performed with the student network minimizing supervised and unsupervised (consistency) objectives, and the teacher weights updated via EMA. Concurrently, the SAM-based model is fine-tuned on the same labeled stream using pseudo-prompts derived from labels or high-confidence predictions. In the interaction phase, the Match-based teacher generates prediction masks from weakly-augmented unlabeled data, from which geometric or point-based prompts are auto-extracted. These prompts steer the SAM-based network to predict high-quality masks (pseudo-labels), which then supervise the Match-based student using strongly augmented versions of the same images. In effect, pseudo-label quality is decoupled from internal teacher-student architecture biases, and prompt-to-mask transformation is made robust and scalable.

The entire system is trained end-to-end with joint losses: Dice and cross-entropy losses for both labeled and unlabeled partitions, weighted adaptively. Prompt types (points for SAM, boxes for MedSAM) are selected in a deterministic, confidence-driven manner to maximize informative coverage while minimizing misalignment. The framework remains agnostic to the specific SAM or Match-based variants chosen, permitting pluggable experimentation.

Experimental Validation

Evaluation spans three datasets: ACDC (cardiac MRI), BUSI (breast ultrasound), and a proprietary multi-sequence MRLiver dataset. Baseline comparisons include adversarial (DAN, ADVENT), classical consistency-based (ICT, Mean Teacher, UA-MT, URPC), and advanced Match-based (U2PL, FixMatch, UniMatch) methods. SAMatch is implemented in four variants, reflecting combinations of Match-based (FixMatch/UniMatch) and SAM-based (SAM/MedSAM) backbones. Metrics are dominated by Dice and 95th percentile Hausdorff Distance (HD95).

Key numerical results include:

ACDC (semi-supervised, 3 labels): Uni-MedSAM achieves a mean Dice of 89.36%, approaching the fully-supervised UNet (91.47%), and outperforms all other baselines by a statistically significant margin ( $p < 0.05$ ).
BUSI (semi-supervised, 30 labels): Uni-SAM yields an object Dice score of 59.35%, only 1.2% below the "full" supervision regime.
MRLiver (semi-supervised, 3 labels): Uni-MedSAM registers a Dice of 80.04% (HD95 = 21.04), denoting strong generalization with minimal annotation burden.

In all instances, integrating the SAM-based backbone leads to consistent and substantial improvements over vanilla Match-based pipelines; MedSAM outperforms vanilla SAM, reflecting the value of downstream medical-domain adaptation. Visualizations corroborate quantitative findings, with SAMatch exhibiting superior focus on pathological regions and more precise boundary localization than prior art.

Theoretical and Practical Implications

The SAMatch framework delivers several notable contributions to the theory and praxis of SSL in medical imaging:

Automatic prompt generation eliminates the major pipeline bottleneck in deploying SAM-based methods in low-label regimes, obviating manual engineering or reliance on abundant meta-annotations.
Decoupling pseudo-label generation from internal architecture biases (teacher-student homogeneity) via the SAM-based assistant increases the robustness and cross-domain transferability of SSL frameworks.
The system's plug-and-play modularity positions it as a flexible platform for future research, enabling rapid benchmarking across combinations of foundational segmentation models and evolving SSL architectures.
In practice, SAMatch significantly reduces annotation budgets required for high-quality segmentation, facilitating accelerated adoption of AI-assisted clinical pipelines, especially in resource-constrained environments.

Limitations and Future Directions

Despite pronounced performance gains, certain limitations and improvement axes are acknowledged:

Prompt misalignment and over-segmentation remain concerns, particularly with minimal-context prompts (points vs. boxes). Incorporation of structural priors or active prompt regularization may improve fail cases.
The current instantiation is 2D; transitioning to 3D or video segmentation is a vital next step, leveraging the latest generative segmentation architectures (e.g., SAM2, MedSAM-2).
Knowledge distillation and multi-view feature transfer (from SAM to student networks) present promising avenues for further compression and domain adaptation.

Conclusion

SAMatch demonstrates that coupling high-capacity foundation models like SAM with consistency-driven semi-supervised learning yields state-of-the-art segmentation under extreme label scarcity. The intrinsic modularity, automatic prompt generation, and robust pseudo-labeling pipeline contribute both scientifically and practically to the development of annotation-efficient clinical image analysis workflows. Future research may extend this paradigm to 3D/temporal domains and integrate more advanced foundations for even broader utility.