Fitzpatrick17k–C: Clean Benchmark for Dermatology
- Fitzpatrick17k–C is a benchmark dataset providing high-fidelity, curated annotations to evaluate cleaning methods in dermatological image data.
- It employs a large-scale crowdsourcing and expert adjudication process to systematically identify off-topic images, near-duplicates, and label errors.
- Standardized evaluation metrics reveal dramatic post-cleaning performance drops, underscoring the need for rigorous data cleaning in medical AI.
Fitzpatrick17k–C, frequently referred to as CleanPatrick, is a rigorously curated benchmark dataset for image data cleaning, built upon the original Fitzpatrick17k dermatology corpus. Designed to enable systematic comparison of data cleaning algorithms in the presence of realistic, naturally occurring noise, Fitzpatrick17k–C provides high-fidelity, expert-validated issue annotations for off-topic content, near-duplicates, and label errors. In both scope and methodology, Fitzpatrick17k–C addresses key deficiencies in synthetic or narrowly scoped cleaning benchmarks and is an essential resource for data-centric artificial intelligence in medical imaging (Gröger et al., 16 May 2025, Abhishek et al., 2024).
1. Origin and Motivation
The original Fitzpatrick17k dataset comprises 16,577 clinical photographs of cutaneous disease from DermNet and Atlas Dermatológico, labeled into 114 diagnostic classes via weak supervision (atlas captions) and annotated with Fitzpatrick skin types (I–VI). Despite its wide adoption for skin lesion analysis, Fitzpatrick17k is characterized by substantial contamination: naturally occurring data quality issues span off-topic images, label misassignments, and duplicate representations, affecting ~16–30% of the data. Notably, Groh et al. did not provide a true held-out test split, compounding the risk of data leakage and evaluation bias (Gröger et al., 16 May 2025, Abhishek et al., 2024).
Fitzpatrick17k–C was constructed as the first exhaustively annotated, realistically contaminated benchmark for data cleaning in the image domain, specifically addressing:
- Real-world distribution of off-topic samples, near-duplicates, and mislabels.
- The need for standardized evaluation and ranking metrics for cleaning methods across these issue types.
- Improved reproducibility and comparability for downstream model training and benchmarking.
2. Exhaustive Annotation and Cleaning Pipeline
Medical Crowdsourcing and Expert Adjudication
A critical methodological innovation is the large-scale annotation campaign executed through Centaur Labs / DiagnosUs, engaging 933 screened medical crowd workers. For each image (or image pair), binary yes/no annotations were collected across three orthogonal tasks:
- Off-topic detection: “Should this image be in a skin-lesion dataset?”
- Near-duplicate detection: “Are these two images near-duplicates?”
- Label-error detection: “Is the assigned diagnosis clearly wrong?”
In total, 496,377 binary annotations were amassed, corresponding to an average of ~10 votes per sample (range 1–225). Definitions and adjudicated prevalence rates, post-aggregation and expert calibration, are summarized below:
| Issue Type | Definition | Prevalence |
|---|---|---|
| Off-topic | Not a human skin lesion (e.g., slides, histology, text, etc.) | 613 / 16,577 ≈ 4% |
| Near-duplicate | Alternate captures of a lesion (thumbnails, angle, lighting) | 3,556 / 16,577 ≈ 21% |
| Label error | Clearly incorrect diagnosis given image content | 3,666 / 16,577 ≈ 22% |
Expert validation was performed by three board-certified dermatologists on stratified random samples (400 items per issue type), yielding Krippendorff’s of 0.91 (near-duplicate), 0.60 (off-topic), and 0.42 (label errors). Agreement between crowd consensus and expert review was 96% for off-topic/near-duplicate and 67% for label errors.
IRT-Inspired Aggregation
Final labels were computed using a generative model adapted from item-response theory (GLAD):
where is the binary label from annotator for item , encodes annotator ability, item difficulty/orientation (, , ), and is the sigmoid function. Posterior inference was performed via variational mean-field optimization (Adam, lr=0.1, 10,000 steps, Pyro framework). Final positive class probability was computed as:
where posterior samples. Binary labels were assigned by thresholding at expert-calibrated values , , for off-topic, near-duplicate, and label-error respectively (Gröger et al., 16 May 2025).
3. Dataset Structure, Splits, and Statistics
Fitzpatrick17k–C retains all 16,577 de-duplicated images as the starting point for its annotation framework; however, the alternate pipeline in (Abhishek et al., 2024) applies aggressive duplicate/outlier elimination for a downstream subset with 11,394 images and standardized 70:10:20 train:val:test splits by diagnosis label. CleanPatrick provides the following metadata per image:
- "image_path" or "image" (PIL object/array)
- "diagnosis" (class label in )
- "skin_type" (I–VI)
- "off_topic_score", "off_topic" (probability, binary flag)
- "near_duplicate_score", "near_duplicate"
- "label_error_score", "label_error"
- "duplicate_group_id" (near-duplicate cluster membership)
- Raw crowd voting table (optional)
No new train/val/test split is imposed by CleanPatrick itself; the original Fitzpatrick17k’s splits (~80% train, ~10% val, ~10% test) can be adopted. The distributions of diagnostic classes and Fitzpatrick skin types are approximately preserved by stratified sampling, e.g., in one split of test images across skin types I–VI (Abhishek et al., 2024).
4. Benchmarking Methodology and Performance
CleanPatrick formalizes data cleaning as a ranking problem. For each contamination type, a method assigns a real-valued score (or for pairs), and practitioners audit the top- items.
Key evaluation metrics:
- Precision@k ():
- Recall@k ():
- Average Precision (AP): ,
- AUROC: area under the true vs. false positive curve
Selected baseline results (Gröger et al., 16 May 2025):
| Issue Type | Prevalence | Best Method (AUROC, AP/Prec@100) |
|---|---|---|
| Off-topic | 3.7% | IForest AUROC=0.77, AP=0.16; SelfClean P@100=0.52 |
| Near-duplicate | 21.4% | SelfClean AUROC=0.92, AP=0.88; pHash/SSIM ≈0.50 |
| Label error | 22.1% | SelfClean AUROC=0.57, AP=0.27; NoiseRank ≈0.48 |
Self-supervised representation-based approaches excel at near-duplicate detection; classical methods (e.g., IForest) achieve competitive off-topic detection under constrained annotation budgets. Label-error detection remains challenging in fine-grained medical tasks.
Notably, post-cleaning performance drops dramatically: For VGG-16 on random stratified splits, accuracy on Fitzpatrick17k–C decreases from (original, contaminated data) to after cleaning (pp), confirming severe benchmark inflation due to prior contamination. All skin-tone subgroups show 5–30pp accuracy drops (Abhishek et al., 2024).
5. Cleaning Methodology and Best Practices
The cleaning methodology applied in (Abhishek et al., 2024) employs metric-learning (fastdup, cleanvision) for initial duplicate discovery, cosine similarity clustering via union-find, and rigorous manual/algorithmic conflict resolution on label and skin type. Clusters with homogeneous labels retain only the highest-resolution representative; those with label conflicts are entirely removed. Erroneous or outlier images are identified via nearest-neighbor similarity thresholds and removed. A stratified split into train:val:test sets ensures label/site balance and eliminates data leakage.
Best practices derived from this process include:
- Use self-supervised metrics for duplicate detection and verify candidates above similarity thresholds (e.g., ).
- Apply union-find for cluster merging and strict removal of label-inconsistent clusters.
- Remove non-dermatological and structurally outlier images before model evaluation.
- Release mapping metadata, code, and standardized splits for reproducibility.
- Quantify the cleaning effect using (cleaned versus original test accuracy).
6. Access, Usage, and Community Impact
Fitzpatrick17k–C is hosted on Hugging Face Datasets, facilitating standardized programmatic access:
1 2 3 4 5 |
from datasets import load_dataset ds = load_dataset("Digital-Dermatology/CleanPatrick") print(ds) # shows splits (usually 'train' with all images) sample = ds["train"][0] print(sample.keys()) |
All images are distributed as PIL.Image objects or file paths, with scores and binary flags per contamination type, and near-duplicate clusters indexed by duplicate_group_id. Release of this resource enables transparent experimental protocols and high-fidelity benchmarking for data-centric workflows in dermatology AI and can inform broader cleaning efforts in other image domains (Gröger et al., 16 May 2025).
A plausible implication is that systematic cleaning of other large-scale medical datasets could reveal similar inflation in reported benchmarks and necessitate recalibration of progress in medical AI model evaluation. The rigorous, expert-grounded annotation and aggregation framework employed here establishes a standard for future work in data-centric machine learning and medical imaging corpus curation.