WMH Segmentation Challenge at MICCAI

Updated 8 February 2026

WMH segmentation is the automated process of delineating white matter hyperintensities on MRI, critical for monitoring vascular aging and small vessel disease.
The challenge provided a rigorously curated multi-center dataset with standardized preprocessing and evaluation metrics like DSC, HD95, and lAVD to ensure fair comparison among methods.
Leading strategies leverage ensemble U-Net architectures, multi-scale aggregation, and advanced data augmentation to achieve state-of-the-art performance and robust inter-scanner generalization.

White matter hyperintensities (WMH) are radiological lesions visible on MRI, commonly associated with vascular aging and small vessel disease. Precise segmentation of WMH is essential for clinical research and disease monitoring. The WMH Segmentation Challenge at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference was established as the first multi-center/multi-scanner public benchmark to systematically evaluate automatic WMH segmentation methods using standardized data, metrics, and blind testing (Kuijf et al., 2019).

1. Challenge Structure, Dataset, and Evaluation Protocol

The MICCAI WMH Segmentation Challenge provided a rigorously curated dataset—60 co-registered T1-weighted and 2D multi-slice FLAIR MRI volumes from three centers for training, with 110 test volumes sequestered from five scanners (including two not present in training) for final evaluation. All FLAIR images were resampled to a common slice thickness (3 mm), bias-corrected, skull-stripped, and registered to their respective T1 volumes. Manual WMH segmentations were produced by multiple raters under the STRIVE criteria, with peer review to minimize annotation bias. Submissions were strictly Dockerized to ensure reproducibility and enforce container-level isolation during blinded assessment (Kuijf et al., 2019).

Evaluation employed five principal metrics:

Dice similarity coefficient (DSC): $\mathrm{DSC}(A,B) = \frac{2\,|A\cap B|}{|A| + |B|}$ .
95th-percentile modified Hausdorff distance (HD95): $H_{95}(A,B) = \max\{ h_{95}(A,B), h_{95}(B,A) \}$ .
Absolute log-transformed volume difference (lAVD): $\mathrm{lAVD} = \left| \log\frac{V_A}{V_B} \right|$ .
Lesion-level recall (sensitivity): fraction of detected lesions out of all ground-truth components.
Lesion-level F1-score: precision/recall harmonic mean for individual lesion detection.

Final ranking averaged normalized ranks across these metrics over all test cases and included an “inter-scanner robustness” score by computing standard deviation of per-scanner median performance (Kuijf et al., 2019).

2. Leading Automated Segmentation Methodologies

Early challenge-winning strategies converged on fully convolutional, U-Net–style encoder–decoder architectures taking both FLAIR and T1 as inputs. Notable high-performing pipelines included:

FCN Ensembles: Li et al. implemented deep 2D U-Nets (19 layers; initial kernels 5×5; five encoding/decoding levels) trained on two-channel axial slices (Li et al., 2018). Each network operated on preprocessed (bias-corrected, normalized, skull-stripped) data. Multiple models with different initializations and augmentation seeds were trained and ensembled by averaging outputs. This approach yielded the highest mean DSC (0.80), recall (0.84), and robust HD (6.3 mm) on the blind test set, with intra-scanner performance stability and competitive results even for unseen scanners (Li et al., 2018, Kuijf et al., 2019).
Multi-Scale Aggregation: Stack-Net (Li et al., 2018) replaced the initial U-Net blocks with deep convolutional stacks (L × 3×3 or 5×5 convolutions) to preserve small-lesion features before pooling. Parallel Stack-Nets with different kernel sizes were fused post hoc by averaging. This model increased small-lesion recall by 8% over baseline ensembles and achieved top lesion recall and F1 metrics, highlighting the benefit of preserving fine detail for highly discontinuous WMH (Li et al., 2018).
Multi-Sized Patch FCN Ensembles: Separate 3D FCNs were trained on patches of varying spatial extents (6×10×6 to 24×40×24), followed by a two-stage ensemble: small and large lesions were processed independently, and a specially-designed “SinAct” activation function in the ensemble-net stage improved the sharpness of thresholding in the final mask, thus optimizing DSC and reducing Hausdorff errors (Wang et al., 2018).

More recent advances further integrated multimodal strategies, attention mechanisms, and robust training objectives (see Sections 4 and 5).

3. Key Innovations and Quantitative Performance

Innovative strategies that distinguish top-performing WMH segmentation methods include:

Data Augmentation: Aggressive geometric (rotation, scaling, shear, elastic deformation), photometric distortions, and simulated motion artifact augmentations were crucial for cross-scanner generalization. Augmentation led to improvements in HD95 ( $\Delta$ 0.57–0.58 mm), recall, and F1-score (Li et al., 2018, Hassan et al., 30 Oct 2025).
Architectural Adaptations: Large initial kernels (5×5) increased receptive field, aiding anatomical context capture. Multi-scale stacks or parallel branches furnished sensitivity to both fine and broad lesion morphologies (Li et al., 2018, Wang et al., 2018).
Learning Objectives and Activation: Dice-based losses were universally adopted to accommodate severe class imbalance. The “SinAct” activation, defined as

$H(x) = \begin{cases} 0, & x < 0, \ x + \frac{1}{2\pi}\sin(2\pi x - \pi), & 0 \le x \le 1, \ 1, & x > 1, \end{cases}$

ensured gradient flow was highest around the threshold for binary conversion, enabling sharper boundaries (Wang et al., 2018).

Performance summary on the 110-case hidden test set (Kuijf et al., 2019, Hassan et al., 30 Oct 2025):

Method	DSC	F1	Recall	HD95 (mm)	lAVD (%)
FCN Ensembles (Li et al., 2018)	0.80	0.76	0.84	6.3	21.9
Stack-Net Aggregation	0.80	0.77	0.87	—	—
SYNAPSE-Net (Hassan et al., 30 Oct 2025)	0.831	0.816	0.84	3.03	13.46
ResU-Net (Jin et al., 2018)	0.75	0.69	0.81	7.35	27.3

SYNAPSE-Net leveraged multi-stream CNNs with late fusion, Swin Transformer bottlenecks, adaptive cross-modal attention, and hierarchical gated decoding—achieving state-of-the-art performance with sharply reduced HD95 and variance (Hassan et al., 30 Oct 2025).

4. Generalization Across Scanners and Domain Robustness

Explicit focus on cross-site/cross-scanner robustness characterizes the challenge. FCN ensemble models trained on multi-scanner datasets with intensive augmentation generalized successfully to both seen and unseen test scanners (Dice=0.745, Recall=0.87 on new scanners; (Li et al., 2018, Kuijf et al., 2019)).

Domain generalization (DG) research extended this, employing methods such as:

Domain-Adversarial Training (DANN): Feature extractors are trained adversarially to produce domain-invariant representations by fooling a domain discriminator attached early in the U-Net, minimizing H-divergence between source domains (Zhao et al., 2021).
Mixup: Virtual samples constructed by convex combinations of input-label pairs from different domains encourage the model to ignore domain-specific cues (Zhao et al., 2021).
MixDANN: Synergistic combination of DANN and mixup, resulting in DSC improvements of $+0.217$ over vanilla U-Net baselines and sharp reductions in HD95 and absolute volume difference (Zhao et al., 2021).

A plausible implication is that robust multi-site WMH segmentation is critically dependent on architectural invariance, training strategy, and explicit domain generalization objectives.

5. Multimodality, Multi-task Learning, and Recent Model Extensions

Multimodal fusion consistently enhances segmentation accuracy. Concatenated FLAIR+T1 inputs improve DSC by 0.02–0.15 over unimodal pipelines, indicating the value of combining high lesion contrast (FLAIR) with anatomical context (T1) (Machnio et al., 27 Jun 2025, Li et al., 2018). Models supporting modality-agnostic inference (e.g., averaging softmax outputs in modality-missing cases) are robust to data incompleteness, a key clinical need (Machnio et al., 27 Jun 2025).

Recent efforts explored multi-task architectures for joint lesion and regional white matter segmentation. However, shared-encoder multi-task U-Nets underperformed their single-task counterparts (lesion DSC dropped from 0.74 to 0.43), suggesting representational conflict between coarse (lesion) and fine (anatomy) tasks and indicating a need for more sophisticated feature decoupling (Machnio et al., 27 Jun 2025). SYNAPSE-Net’s dynamic cross-modal attention and hierarchical gating effectively integrate multimodal and global contextual information (Hassan et al., 30 Oct 2025).

6. Lessons, Limitations, and Future Directions

The WMH Segmentation Challenge established that:

Ensemble-based, robustly regularized U-Net architectures remain a performant foundation when combined with strong preprocessing and multimodal inputs.
Difficulty-aware sampling, composite losses (e.g., Focal-Tversky + Boundary loss), and pathology-specific augmentations deliver state-of-the-art performance and consistency (Hassan et al., 30 Oct 2025).

Limitations across studies include heuristic component sizes for ensemble fusions, persistent trade-off between sensitivity to small lesions and false positives, and unresolved challenges in simultaneous multi-task learning (Wang et al., 2018, Machnio et al., 27 Jun 2025). Improvements may arise from:

Task-decoupled encoders, end-to-end multi-scale feature pyramids, and learnable lesion grouping thresholds.
Uncertainty estimation, domain-adaptive harmonization, and joint segmentation of WMH with other cerebrovascular pathologies.
Incorporation of 3D context for improved volumetric consistency and clinical utility in large-scale, longitudinal cohorts.

The WMH Segmentation Challenge and subsequent research set a public, reproducible benchmark, driving algorithmic advances and providing a foundation for robust clinical translation (Kuijf et al., 2019, Hassan et al., 30 Oct 2025).