Fetal Ultrasound Grand Challenge (FUGC)

Updated 29 January 2026

Fetal Ultrasound Grand Challenge (FUGC) is a coordinated benchmark initiative that standardizes tasks like segmentation, landmark localization, and biometry in obstetric ultrasound.
It provides extensive public datasets with rigorous annotations to address variability in image quality and label scarcity, enabling reproducible evaluation across multi-center data.
FUGC drives multi-institutional collaboration and diverse algorithmic approaches to enhance clinical applications such as labor progress assessment, fetal growth monitoring, and preterm birth risk prediction.

The Fetal Ultrasound Grand Challenge (FUGC) is a coordinated series of international benchmarks, public datasets, and algorithmic competitions dedicated to advancing automated analysis of fetal and maternal structures in obstetric ultrasound. Through multi-institutional collaborations, FUGC defines standardized tasks, metrics, and evaluation protocols for problems such as structure segmentation, anatomical landmark localization, biometric measurement, and semi-supervised learning under label scarcity in both transabdominal and transvaginal ultrasound modalities. Building on a lineage of prior challenges, FUGC’s recent editions have focused on robust segmentation and biometry across varied imaging hardware and populations, efficient learning from limited or weak supervision, and translation of AI models to clinically relevant endpoints such as preterm birth risk and labor progression assessment.

1. Scope and Rationale

FUGC aims to provide reproducible, clinically pertinent benchmarks for core tasks in fetal and maternal ultrasound analysis. The challenge structure addresses several persistent barriers in the field:

Variability in acquisition: Operator-dependent image quality, anatomical heterogeneity, and protocol differences impede model generalization.
Limited expert annotation: Manual labeling is labor-intensive, leading to label scarcity, especially for large or multi-center datasets.
Critical clinical endpoints: Reliable quantification of fetal biometry, cervical length, or labor progress requires both robust segmentation and measurement consistency across centers and devices.

Tasks range from pubic symphysis and fetal head segmentation in intrapartum imaging to landmark-based biometry in standardized planes, cervical segmentation for preterm birth risk, and zero/few-shot evaluation with foundation models.

2. Datasets and Annotation Protocols

FUGC makes available several large-scale, rigorously annotated datasets. Notable datasets and protocols include:

PSFHS (Pubic Symphysis–Fetal Head Segmentation): 5,101 intrapartum US images (256×256 BMP; 1,175 women, ≥ 37 weeks) from multiple devices/sites, annotated by dual raters and adjudicated by experts. Contours for pubic symphysis (PS) and fetal head (FH) provided; intra-/inter-rater Dice score ≈ 87–90% (Bai et al., 2024).
Landmark-based Biometry: 4,513 standard-plane frames (head, abdomen, femur) from 1,904 subjects across three European centers and seven ultrasound devices, with manual and ellipse-derived anatomical landmark annotation, plus comprehensive quality control and cross-center subject-disjoint splits (Vece et al., 18 Dec 2025).
Cervical Segmentation (TVS): 890 transvaginal ultrasound images; 50 expertly labeled, 450 unlabeled for semi-supervised learning, with separate validation and test sets. Annotation includes anterior and posterior cervical lip delineation, guided by the ISUOG cervical length protocol (Bai et al., 22 Jan 2026, Le et al., 14 Apr 2025).

Annotation tools, rater training, and adjudication protocols are documented, and datasets are made publicly available with evaluation scripts.

3. Benchmark Tasks and Evaluation Metrics

FUGC defines a suite of standardized tasks and metrics:

Segmentation: Pixel-wise delineation of anatomical targets (e.g., PS, FH, cervix) using metrics such as Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and Average Surface Distance (ASD). Combined and per-class scores are reported (Bai et al., 2024, Bai et al., 22 Jan 2026).
Landmark Localization & Biometry: Precise placement of anatomical landmarks for head, abdomen, and femur measurements. Performance is evaluated via mean absolute error (MAE), standard deviation (STD), normalized mean error (NME), and cumulative error distributions, converted to millimeters using provided pixel scaling factors (Vece et al., 18 Dec 2025).
Classification: Frame-level or video-level recognition of standard anatomical planes (e.g., standard-plane vs. non-standard), assessed by accuracy, F1, AUC, and MCC (Ramesh et al., 20 May 2025).
Semi-Supervised Learning: For tasks with extreme label scarcity, methods must efficiently utilize large quantities of unlabeled data. Evaluation couples overlap (mDSC), boundary (mHD), and run-time (RT) with weighted final scores (Bai et al., 22 Jan 2026).
Biometric Computation: Angle of Progression (AoP) and Head–Symphysis Distance (HSD) are derived automatically from segmentations, enabling direct comparison against ground truth via mean absolute error in degrees or millimeters (Xia et al., 4 Jun 2025, Ramesh et al., 20 May 2025).
Domain Generalization: Challenge designs include in-domain, cross-domain, and multi-center tracks, directly quantifying how methods perform under domain shift (Vece et al., 18 Dec 2025).

These metrics collectively assess not only algorithmic accuracy but also robustness to clinical variability and computational practicality.

4. Algorithms and Winning Approaches

FUGC encourages methodological diversity, including convolutional, transformer, and semi-supervised frameworks. Representative approaches include:

Model/Team	Backbone	Task(s)	Notable Features
Aloha (PSFHS)	Segment Anything (ViT-h)	PSFH segmentation	LoRA fine-tuning, ensemble, strong aug, post-processing
angle_avengers	nnU-Net (residual)	PSFH segmentation	Hausdorff+Focal loss, 5-fold ensemble
CQUT-Smart	DSSAU-Net	PS, FH segmentation	Dual Sparse Selection Attention, multi-scale fusion
FetalCLIP	ViT-L (CLIP-based)	Multi-task	Contrastive vision-language pre-training, zero/few-shot
HDC	ResNet50, single-teacher	Cervical SSL	Correlation guidance + mutual information distillation
T1 (FUGC-ISBI)	U-Net, human-in-loop	Cervical SSL	Multi-stage pseudo-labeling, ensemble, LabelStudio
BiometryNet	HRNet-W18	Biometry	Heatmap regression, orientation vector for endpoint order

Top segmentation results (PSFH Dice, mDSC cervix) approach 92–93% in expert-annotated evaluations; cross-domain biometry NME remains lowest with multi-center training (Bai et al., 2024, Vece et al., 18 Dec 2025, Bai et al., 22 Jan 2026).

5. Clinical and Scientific Impact

FUGC outputs directly advance core aims in perinatal medicine:

Labor Progress Assessment: Automated PS and FH segmentation enables reproducible computation of AoP and HSD. These are strongly correlated with labor outcome and reduce reliance on subjective digital examinations (Xia et al., 4 Jun 2025, Bai et al., 2024).
Fetal Growth Monitoring: Standardized biometric landmark detection enables reliable cross-center, cross-device fetal growth assessments, highlighting the necessity of domain adaptation for clinical deployment (Vece et al., 18 Dec 2025).
Preterm Birth Risk: Robust, label-efficient cervical segmentation supports scalable screening for preterm birth in both resource-rich and low-resource settings, overcoming bottlenecks due to annotation cost and inter-operator variability (Bai et al., 22 Jan 2026, Le et al., 14 Apr 2025).
Foundation Models in US: Application of large-scale multimodal pre-training (e.g., FetalCLIP) achieves superior zero/few-shot generalization on ultrasound compared to prior task-specific models or generic VLMs, with implications for transferability and annotation-efficiency (Maani et al., 20 Feb 2025).

The standardized datasets, codebases, and metric suites provided by FUGC have become reference resources for both the computer vision and clinical communities.

6. Limitations, Open Challenges, and Future Directions

Despite progress, several limitations persist:

Most segmentation tasks have been limited to 2D; 3D+t (spatiotemporal) analysis and video-based biometry remain underexplored (Bai et al., 2024, Xia et al., 4 Jun 2025).
Pretraining often overrepresents specific gestational ages or normal anatomy, potentially limiting generalization to early/late gestation or pathological cases (Maani et al., 20 Feb 2025).
Inter-operator and inter-device annotation variability is incompletely characterized; multi-rater annotation releases and calibration are needed for robust benchmarking (Vece et al., 18 Dec 2025, Bai et al., 22 Jan 2026).
Real-time integration remains challenging. High-accuracy transformer models often exhibit increased inference time or resource usage, motivating lightweight distillation and architectural pruning (Bai et al., 22 Jan 2026).
Downstream clinical endpoints—functional assessment, risk stratification, integration into electronic health records—require ongoing validation, especially in prospective, multi-institutional studies (Ramesh et al., 20 May 2025, Xia et al., 4 Jun 2025).

Future FUGC editions are expected to address video/sequence labeling, 3D segmentation, detailed biometric regression, and joint modeling of multiple obstetric structures.

7. Resources and Community Engagement

FUGC datasets, code, and evaluation platforms are openly released to facilitate reproducible research:

Data splits on Zenodo (e.g., PSFHS, BiometryNet datasets)
Challenge leaderboards, code containers, and evaluation tools (e.g., BIAs, ChallengeR, official Codalab leaderboards)
Model implementations (e.g., DSSAU-Net, FetalCLIP, pipeline codebases for cervical SSL and biometry)

Active community engagement is fostered via MICCAI, ISBI, and other conferences, with results tracked in standardized leaderboards and recurrent evaluation cycles.

FUGC stands as a pivotal resource in fetal and maternal ultrasound AI, providing the scale, rigor, and transparency required for systematic benchmarking, accelerating translation from algorithmic innovation to clinical impact across diverse real-world settings (Maani et al., 20 Feb 2025, Bai et al., 2024, Vece et al., 18 Dec 2025, Ramesh et al., 20 May 2025, Bai et al., 22 Jan 2026, Xia et al., 4 Jun 2025, Le et al., 14 Apr 2025).