ULS23 Challenge Leaderboard for Lesion Segmentation
- ULS23 Challenge Test-Phase Leaderboard is a benchmarking platform that ranks fully automatic 3D lesion segmentation methods on CT data using a multi-metric composite score.
- It employs a robust scoring system combining segmentation performance, measurement accuracy, and stability to drive methodological innovations in lesion evaluation.
- The leaderboard enforces standardized Docker-based submissions and detailed reporting under controlled hardware conditions, promoting reproducible research.
The ULS23 Challenge Test-Phase Leaderboard ranks algorithms for 3D universal lesion segmentation in chest-abdomen-pelvis CT data, providing a standardized infrastructure, benchmarking dataset, and robust evaluation protocol for clinically relevant lesion segmentation across diverse anatomical sites. The leaderboard reports combined challenge scores (CS) derived from multiple segmentation and measurement metrics, offering a rigorous comparative framework for fully automatic methods in realistic, multi-type lesion environments (Grauw et al., 2024).
1. Dataset Composition and Challenge Scope
The ULS23 test-phase dataset comprises 268 patients and 725 lesions (following validation set exclusion), sourced from Radboudumc and JBZ centers. Lesions span lymph nodes, kidney, lung, liver, pancreas, colon, bone, peritoneal implants, and breast, representing 51.4 % clinically targeted in radiology reports. Scanner manufacturer distribution and sex balance are provided: 53.5 % male, 46.5 % female. Each lesion is annotated for volumetric segmentation and linear measurement, supporting evaluation in both size measurement and segmentation consistency. The dataset enforces anatomical and scanner diversity, underpinning reliable model generalization (Grauw et al., 2024).
2. Evaluation Metrics and Mathematical Framework
Leaderboard ranking is predicated on a composite score (CS), calculated as:
- (Segmentation Performance): Mean Dice coefficient over all lesions,
- and : Long/short-axis measurement errors via SMAPE,
- : Mean pairwise Dice across predictions for VOI re-centerings (“clicks”),
This multifactorial scoring schema prioritizes segmentation accuracy but incorporates geometric measurement precision and robustness to VOI centering variability (Grauw et al., 2024). For ULS+, a related robustness metric is reported as mean pairwise Dice over three VOI crops (Weber et al., 6 Jan 2026).
3. Submission Requirements and Infrastructure
Leaderboard participation mandates submission of a Docker container implementing fully automatic, non-interactive inference on 32-bit NIfTI input (VOI size 256 × 256 × 128). Each entry processes the hidden test set under standardized hardware constraints—NVIDIA T4 GPU (16 GB VRAM), 8 vCPUs, 32 GB RAM—with a time limit of 9 minutes per 100 lesions. Algorithms must be described in a manuscript covering architecture, data sources, loss functions, and hyperparameters; use of external data is allowed if properly documented. Submissions are ranked on the CS metric without tie-break rules; open benchmarking continues beyond the original event (Grauw et al., 2024).
4. Test-Phase Leaderboard Results
The leaderboard, as summarized in the report on ULS+ (Weber et al., 6 Jan 2026), contains a single ranked method:
| Method | Rank | Challenge Score (Dice-based) |
|---|---|---|
| ULS+ | 1 | 0.749 |
ULS+ achieved the highest challenge score, improving mean Dice from 0.74 ± 0.20 (ULS baseline) to 0.78 ± 0.15 (p < 0.0001), denoting a ~ 5 % relative gain. Robustness to click-point variability increased from 0.81 ± 0.24 (ULS) to 0.86 ± 0.20 (ULS+), reflecting more stable predictions under VOI center shifts. The original ULS baseline (test Dice 0.703 ± 0.240 (Grauw et al., 2024)) and detailed per-type scores confirm consistent improvements across anatomical regions.
5. Methodological Innovations in ULS+
ULS+ introduces two principal advancements:
- Incorporation of six additional fully annotated public CT-lesion datasets, expanding lesion type diversity.
- Train-time augmentation via random “click shifts” within each lesion, increasing model resilience to VOI center misplacement.
Technical workflow includes reduced VOI size (128 × 128 × 64 voxels), with nnUNet v2 and residual encoder (size L), resampling disabled, and test-time augmentation (three crops; centroid plus two random points; majority voting/averaging for final mask). Inference executes in ≈ 0.5 s per VOI on NVIDIA A100 SXM4. Robustness assessment employs mean pairwise Dice between segmentations for normal and augmented VOI centers, quantifying output stability (Weber et al., 6 Jan 2026).
6. Prospective Improvements and Research Directions
Planned enhancements for future iterations include targeted loss weighting or over-sampling for large/complex lesions, multi-reader segmentations and consensus annotation protocols to reduce subjectivity, and exploration of performance–speed tradeoffs via ensemble and augmentation strategies. Metrics expansion includes rare lesion types, other modalities (e.g., MRI), and advanced spatial measures such as Hausdorff distance. A plausible implication is investigation of downstream clinical impacts, including workflow efficiency gains and reductions in inter-observer variability (Grauw et al., 2024).
7. Context, Limitations, and Significance
The ULS23 leaderboard provides a transparent, extensible platform for benchmarking universal lesion segmentation algorithms. While initial leaderboard results are limited to ULS+ and direct baseline comparison, this framework anchors iterative development and validation, with open submission policy and rigorous protocol design. The focus on multi-type, clinically annotated lesions and robust, multi-metric scores positions ULS23 as a reference standard for future segmentation research. The absence of additional ranked methods in reported results suggests ongoing need for wider benchmarking and cross-team participation. Systematic evaluation of robustness, measurement fidelity, and anatomical coverage underpins the leaderboard’s relevance for real-world, radiology-centered algorithm assessment (Weber et al., 6 Jan 2026, Grauw et al., 2024).