Semi-Supervised Cervical Segmentation

Updated 29 January 2026

The paper demonstrates that semi-supervised learning bridges the performance gap to fully supervised methods by using limited labeled data and large unlabeled datasets in cervical imaging.
It employs robust methodologies such as pseudo-labeling, consistency regularization, and human-in-the-loop strategies to enhance segmentation accuracy across ultrasound, cytology, and CT modalities.
Empirical results from benchmarks like FUGC show high Dice scores and improved anatomical robustness, effectively overcoming annotation scarcity challenges.

Semi-supervised learning (SSL) in cervical segmentation refers to the set of techniques that exploit both limited labeled and abundant unlabeled data to train models capable of accurately segmenting cervical anatomical structures in medical images, particularly where full annotation is prohibitively expensive or infeasible. This approach leverages a variety of architectural, algorithmic, and statistical mechanisms—including pseudo-labeling, consistency regularization, knowledge distillation, and anatomical priors—to bridge the performance gap with fully supervised baselines using a fraction of the manual labels. As demonstrated by contemporary SSL benchmarks such as the Fetal Ultrasound Grand Challenge (FUGC), robust semi-supervised approaches have achieved near-parity with fully supervised methods in cervical segmentation across ultrasound, cytology, and radiotherapy imaging modalities (Bai et al., 22 Jan 2026).

1. Datasets and Benchmarking Paradigms

Recent progress in SSL for cervical segmentation is closely linked to the availability of curated datasets and evaluation protocols designed to stress-test learning under annotation scarcity. The FUGC benchmark represents the first semi-supervised testbed for 2D transvaginal ultrasound (TVS) cervical segmentation, consisting of 890 anonymized TVS frames (544×336 px ROI): 500 training (50 gold standard labeled, 450 unlabeled), 90 fully labeled validation, and 300 fully labeled test images. Annotations target anterior and posterior cervical lips, with gold masks produced via expert refinement of a SAM model. Unlabeled frames enable methods to exploit the underlying data distribution despite limited manual supervision (Bai et al., 22 Jan 2026).

In cytology, instance-level cervical cell segmentation leverages labeled crops of Pap smears (82 patients, 961 training, 98 test, ~9,000 cell/nucleus masks) supplemented with >4,000 unlabeled slides (Zhou et al., 2020). In the context of radiotherapy planning, large but imprecise CT archives with missing organ-at-risk (OAR) labels require DSL approaches with annotation imputation and uncertainty handling (1,170 CTs, 134 fully-labeled "clean," 984 with partial OARs) (Grewal et al., 2023).

2. Core Methodological Approaches

2.1 Pseudo-labeling and Consistency-driven SSL

A common paradigm involves leveraging model predictions on unlabeled data (pseudo-labels) as additional supervision. For example, the highest-performing FUGC approaches utilize nnUNet or U-Net backbones to generate initial pseudo-labels, which are iteratively refined via morphological post-processing and, optionally, human correction. The supervised loss on gold-labeled samples, $L_s$ , is combined with a down-weighted unsupervised loss on pseudo-labels, $L_u$ , employing a composite objective:

$L_{\text{total}} = L_s + \lambda L_u$

where $\lambda$ is tuned to suppress noise from unreliable pseudo-labels (Bai et al., 22 Jan 2026).

Consistency regularization further constrains model predictions to be invariant under plausible perturbations (e.g., augmentations, dropout). Mean Teacher (MT) frameworks utilize an exponential moving average (EMA) of student weights to define a teacher model, with a consistency loss enforcing agreement between student and teacher predictions on unlabeled data, optionally weighted by an entropy-adaptive schedule (Bai et al., 22 Jan 2026, V et al., 2023, Zhou et al., 2020).

2.2 Human-in-the-loop and Ensemble Strategies

FUGC's rank-1 solution integrates a human-in-the-loop framework, wherein pseudo-labels are iteratively corrected by annotators and incorporated back into the training set, effectively blending iterative distillation with expert-in-the-loop refinement (Bai et al., 22 Jan 2026). Ensemble voting (e.g., majority voting among nnUNet/UNet clones) further stabilizes segmentation via model diversity.

2.3 Dual/Co-training Networks and Contrastive Learning

Dual-network architectures utilize two distinct segmentation backbones (e.g., U-Net and Swin-UNet) that symmetrically cross-supervise each other on unlabeled samples, each using the other's prediction as a hard pseudo-label. No explicit confidence filtering is applied; both network predictions serve as mutual regularizers. Self-supervised contrastive learning (InfoNCE loss) on deep embeddings further encourages discrimination among samples from different images, improving representation learning, particularly in the unlabeled regime (Wang et al., 21 Mar 2025).

2.4 Knowledge Distillation and Region-specific Consistency

Instance-level cervical cell segmentation employs Mask-guided Mean Teacher (MMT) frameworks, enforcing dual-level consistency:

Semantic-level: Teacher pseudo-labels are generated via self-ensembling of multiple augmentations, then sharpened to reduce entropy and focus on confident samples.
Feature-level: Foreground-masked distillation ensures that only model activations within predicted cell regions are aligned via an adaptation loss, suppressing noise from background regions (Zhou et al., 2020).

Perturbation-sensitive sample mining (PSM) refines which background proposals contribute to unsupervised loss by selecting the most uncertain (high-variance) predictions, dramatically improving semantic consistency on difficult cases.

2.5 Anatomically-aware and Uncertainty-guided SSL

Uncertainty estimation is increasingly integrated to address label noise and anatomical implausibility. Anatomically-aware frameworks learn a denoising autoencoder (DAE) that defines a low-dimensional shape manifold of plausible cervical anatomies. The deviation between model predictions and their projection onto the manifold yields a per-voxel uncertainty map, $U(x)$ , used to weight consistency losses. This approach enables single-pass (efficient) uncertainty mapping and robust leveraging of scarce gold masks (V et al., 2023).

Teacher-student imputation frameworks fill missing mask annotations using a multi-head teacher ensemble and train a student on both real and imputed (pseudo) labels. An uncertainty-weighted cross-entropy loss,

$\mathcal{L}_{\text{uCE}}(v) = -\,e^{-u(v)}\sum_{c=1}^C y_c(v)\log \hat{y}_c(v)$

attenuates the influence of highly uncertain voxels during student learning (Grewal et al., 2023).

3. Training Objectives, Loss Functions, and Hyperparameters

SSL methods universally utilize composite losses that balance supervised and unsupervised (or consistency) components. Common ingredients include:

Dice similarity and cross-entropy losses for segmentation overlap and pixel-wise classification on labeled data.
Mean squared error (MSE) or cross-entropy consistency on unlabeled data, according to pseudo-labels or teacher-student predictions.
Dynamic ramp-up schedules for $\lambda$ or $\lambda_c(t)$ to defer unsupervised regularization until models are stable.
Weighting functions based on uncertainty (entropy, deviation from anatomical prior) or perturbation-sensitivity variance.

Data augmentation (rigid transforms, elastic deformations, color jitter, intensity normalization) is critical for both anatomical priors and robust segmentation under limited manual annotation (V et al., 2023, Wang et al., 21 Mar 2025).

4. Empirical Results and Performance Benchmarks

Table: Selected Performance on Cervical Segmentation Benchmarks

Method & Reference	Modality	Labeled Data	mDSC (%)	mHD (px/mm)	Inference Time (ms)
T1 ("Human-in-the-Loop")	TVS (FUGC)	50/500	90.08	40.39	315.9
T4 ("Pseudo-Labels+Voting")	TVS (FUGC)	50/500	90.26	38.88	652.4
T6 ("Co-Training MT")	TVS (FUGC)	50/500	85.76	63.62	32.86
Dual-Net+Contrastive (Wang et al., 21 Mar 2025)	US (muscle)	50	86.0	46.44	16.95
MMT-PSM (Zhou et al., 2020)	Cytology	96/961+4,371	63.45*	–	–
TS-Uncertainty (Grewal et al., 2023)	CT (OARs)	134/1,170	87.16	9.92 (HD95)	–

*AJI average, see paper for details.

FUGC demonstrates that with only 50 gold standard TVS segmentations, multiple teams matched the Dice overlap of fully supervised models (90–93%) trained on 100–300 labels, indicating substantial label-efficiency gains (Bai et al., 22 Jan 2026). Dual-network and contrastive approaches on muscle ultrasound also surpassed conventional models in both Dice and Hausdorff metrics (Wang et al., 21 Mar 2025). Instance segmentation of cell cytology showed improvement up to +2.98% AJI and +2.92% mAP with only 10% labeled data when compared to supervised baselines (Zhou et al., 2020). In radiotherapy segmentation, a combination of robust annotation imputation, uncertainty-weighting, and aggressive data cleaning yields a mean Dice of 87.16% on OARs, outperforming standard 3D U-Net baselines (Grewal et al., 2023).

5. Challenges, Limitations, and Common Failure Modes

Pseudo-label reliability varies spatially, particularly for anatomically variable subregions (e.g., posterior lip in the cervix), which can compromise generalization if not properly down-weighted or filtered (Bai et al., 22 Jan 2026).
Annotation scarcity complicates inter-observer variability assessment and limits multi-center generalization.
Domain shift (rare pathological cases or scanner variance) remains an open challenge, with limited evidence for robustness provided in current single-center benchmarks.
Noise from trivial background predictions (e.g., in cell proposals) is addressed by perturbation-sensitive selection or foreground-masked feature alignment (Zhou et al., 2020).
Highly uncertain or anatomically implausible predictions are now handled by adaptive loss-weighting, but require reliable estimation of per-voxel or proposal-level uncertainty—a nontrivial problem under severe label constraints (V et al., 2023, Grewal et al., 2023).

6. Future Directions and Open Research Questions

Emerging directions informed by recent SSL benchmarks include:

Aggregation and harmonization of multi-expert or multi-center annotations to quantify label uncertainty and variance, potentially via ensemble methods or federated learning (Bai et al., 22 Jan 2026).
Incorporation of advanced pretraining (self-supervised, multi-modal) to mitigate domain shift and facilitate adaptation to rare clinical subtypes.
Dynamic, uncertainty-aware loss schedules, and refined anatomically-based consistency objectives to further suppress the impact of label noise and anatomical outliers.
Hardware-aware model compression and automated architecture search targeting real-time inference for edge deployment in clinical environments (Bai et al., 22 Jan 2026).
Extension of perturbation- and uncertainty-driven sample mining protocols beyond cytology to ultrasound and cross-sectional modalities.

A plausible implication is that the convergence of anatomical priors, robust pseudo-labeling, scalable ensemble strategies, and domain-adaptive consistency frameworks sets the stage for clinically viable, highly data-efficient AI systems in maternal-fetal medicine and gynecologic oncology.

References:

FUGC Benchmark: "FUGC: Benchmarking Semi-Supervised Learning Methods for Cervical Segmentation" (Bai et al., 22 Jan 2026)
Anatomical uncertainty: "Anatomically-aware Uncertainty for Semi-supervised Image Segmentation" (V et al., 2023)
Knowledge distillation (cytology): "Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation" (Zhou et al., 2020)
Dual framework: "Semi-supervised Cervical Segmentation on Ultrasound by A Dual Framework for Neural Networks" (Wang et al., 21 Mar 2025)
Radiotherapy OARs: "Clinically Acceptable Segmentation of Organs at Risk in Cervical Cancer Radiation Treatment from Clinically Available Annotations" (Grewal et al., 2023)