Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fitzpatrick17k–C: Clean Benchmark for Dermatology

Updated 23 February 2026
  • Fitzpatrick17k–C is a benchmark dataset providing high-fidelity, curated annotations to evaluate cleaning methods in dermatological image data.
  • It employs a large-scale crowdsourcing and expert adjudication process to systematically identify off-topic images, near-duplicates, and label errors.
  • Standardized evaluation metrics reveal dramatic post-cleaning performance drops, underscoring the need for rigorous data cleaning in medical AI.

Fitzpatrick17k–C, frequently referred to as CleanPatrick, is a rigorously curated benchmark dataset for image data cleaning, built upon the original Fitzpatrick17k dermatology corpus. Designed to enable systematic comparison of data cleaning algorithms in the presence of realistic, naturally occurring noise, Fitzpatrick17k–C provides high-fidelity, expert-validated issue annotations for off-topic content, near-duplicates, and label errors. In both scope and methodology, Fitzpatrick17k–C addresses key deficiencies in synthetic or narrowly scoped cleaning benchmarks and is an essential resource for data-centric artificial intelligence in medical imaging (Gröger et al., 16 May 2025, Abhishek et al., 2024).

1. Origin and Motivation

The original Fitzpatrick17k dataset comprises 16,577 clinical photographs of cutaneous disease from DermNet and Atlas Dermatológico, labeled into 114 diagnostic classes via weak supervision (atlas captions) and annotated with Fitzpatrick skin types (I–VI). Despite its wide adoption for skin lesion analysis, Fitzpatrick17k is characterized by substantial contamination: naturally occurring data quality issues span off-topic images, label misassignments, and duplicate representations, affecting ~16–30% of the data. Notably, Groh et al. did not provide a true held-out test split, compounding the risk of data leakage and evaluation bias (Gröger et al., 16 May 2025, Abhishek et al., 2024).

Fitzpatrick17k–C was constructed as the first exhaustively annotated, realistically contaminated benchmark for data cleaning in the image domain, specifically addressing:

  • Real-world distribution of off-topic samples, near-duplicates, and mislabels.
  • The need for standardized evaluation and ranking metrics for cleaning methods across these issue types.
  • Improved reproducibility and comparability for downstream model training and benchmarking.

2. Exhaustive Annotation and Cleaning Pipeline

Medical Crowdsourcing and Expert Adjudication

A critical methodological innovation is the large-scale annotation campaign executed through Centaur Labs / DiagnosUs, engaging 933 screened medical crowd workers. For each image (or image pair), binary yes/no annotations were collected across three orthogonal tasks:

  • Off-topic detection: “Should this image be in a skin-lesion dataset?”
  • Near-duplicate detection: “Are these two images near-duplicates?”
  • Label-error detection: “Is the assigned diagnosis clearly wrong?”

In total, 496,377 binary annotations were amassed, corresponding to an average of ~10 votes per sample (range 1–225). Definitions and adjudicated prevalence rates, post-aggregation and expert calibration, are summarized below:

Issue Type Definition Prevalence
Off-topic Not a human skin lesion (e.g., slides, histology, text, etc.) 613 / 16,577 ≈ 4%
Near-duplicate Alternate captures of a lesion (thumbnails, angle, lighting) 3,556 / 16,577 ≈ 21%
Label error Clearly incorrect diagnosis given image content 3,666 / 16,577 ≈ 22%

Expert validation was performed by three board-certified dermatologists on stratified random samples (400 items per issue type), yielding Krippendorff’s α\alpha of 0.91 (near-duplicate), 0.60 (off-topic), and 0.42 (label errors). Agreement between crowd consensus and expert review was 96% for off-topic/near-duplicate and 67% for label errors.

IRT-Inspired Aggregation

Final labels were computed using a generative model adapted from item-response theory (GLAD):

P(ya,i=1ca,bi)=σ(cabi)P\bigl(y_{a,i}=1\mid c_a,b_i\bigr) = \sigma\bigl(c_a\,b_i\bigr)

where ya,iy_{a,i} is the binary label from annotator aa for item ii, cac_a encodes annotator ability, bib_i item difficulty/orientation (caN(0,1)c_a \sim N(0,1), biN(0,σb2)b_i \sim N(0, \sigma_b^2), σb=103\sigma_b = 10^3), and σ()\sigma(\cdot) is the sigmoid function. Posterior inference was performed via variational mean-field optimization (Adam, lr=0.1, 10,000 steps, Pyro framework). Final positive class probability pˉi\bar p_i was computed as:

pˉi=1Mm=1MI[bi(m)>0]\bar p_i = \frac{1}{M}\sum_{m=1}^M \mathbb{I}\left[b_i^{(m)} > 0\right]

where M=1,000M=1{,}000 posterior samples. Binary labels were assigned by thresholding pˉi\bar p_i at expert-calibrated values tOTt_{OT}, tNDt_{ND}, tLEt_{LE} for off-topic, near-duplicate, and label-error respectively (Gröger et al., 16 May 2025).

3. Dataset Structure, Splits, and Statistics

Fitzpatrick17k–C retains all 16,577 de-duplicated images as the starting point for its annotation framework; however, the alternate pipeline in (Abhishek et al., 2024) applies aggressive duplicate/outlier elimination for a downstream subset with 11,394 images and standardized 70:10:20 train:val:test splits by diagnosis label. CleanPatrick provides the following metadata per image:

  • "image_path" or "image" (PIL object/array)
  • "diagnosis" (class label in {1,,114}\{1,\dots,114\})
  • "skin_type" (I–VI)
  • "off_topic_score", "off_topic" (probability, binary flag)
  • "near_duplicate_score", "near_duplicate"
  • "label_error_score", "label_error"
  • "duplicate_group_id" (near-duplicate cluster membership)
  • Raw crowd voting table (optional)

No new train/val/test split is imposed by CleanPatrick itself; the original Fitzpatrick17k’s splits (~80% train, ~10% val, ~10% test) can be adopted. The distributions of diagnostic classes and Fitzpatrick skin types are approximately preserved by stratified sampling, e.g., in one split [19.81,19.26,22.64,25.84,28.78,24.68]%[19.81, 19.26, 22.64, 25.84, 28.78, 24.68]\% of test images across skin types I–VI (Abhishek et al., 2024).

4. Benchmarking Methodology and Performance

CleanPatrick formalizes data cleaning as a ranking problem. For each contamination type, a method ff assigns a real-valued score s(x)s(x) (or s(x,x)s(x,x') for pairs), and practitioners audit the top-kk items.

Key evaluation metrics:

  • Precision@k (P@kP@k): (1/k)i=1kI[y^(i)=1](1/k) \sum_{i=1}^k \mathbb{I}[\hat{y}_{(i)}=1]
  • Recall@k (R@kR@k): (1/P+)i=1kI[y^(i)=1](1/P^+) \sum_{i=1}^k \mathbb{I}[\hat{y}_{(i)}=1]
  • Average Precision (AP): AP=k=1NP(k)Δr(k)\mathrm{AP} = \sum_{k=1}^N P(k) \, \Delta r(k), Δr(k)=I[y^(k)=1]P+\Delta r(k)=\frac{\mathbb{I}[\hat y_{(k)}=1]}{P^+}
  • AUROC: area under the true vs. false positive curve

Selected baseline results (Gröger et al., 16 May 2025):

Issue Type Prevalence Best Method (AUROC, AP/Prec@100)
Off-topic 3.7% IForest AUROC=0.77, AP=0.16; SelfClean P@100=0.52
Near-duplicate 21.4% SelfClean AUROC=0.92, AP=0.88; pHash/SSIM ≈0.50
Label error 22.1% SelfClean AUROC=0.57, AP=0.27; NoiseRank ≈0.48

Self-supervised representation-based approaches excel at near-duplicate detection; classical methods (e.g., IForest) achieve competitive off-topic detection under constrained annotation budgets. Label-error detection remains challenging in fine-grained medical tasks.

Notably, post-cleaning performance drops dramatically: For VGG-16 on random stratified splits, accuracy on Fitzpatrick17k–C decreases from 67.3%67.3\% (original, contaminated data) to 22.25%±0.60%22.25\%\pm0.60\% after cleaning (ΔAcc=45.05\Delta Acc =-45.05pp), confirming severe benchmark inflation due to prior contamination. All skin-tone subgroups show 5–30pp accuracy drops (Abhishek et al., 2024).

5. Cleaning Methodology and Best Practices

The cleaning methodology applied in (Abhishek et al., 2024) employs metric-learning (fastdup, cleanvision) for initial duplicate discovery, cosine similarity clustering via union-find, and rigorous manual/algorithmic conflict resolution on label and skin type. Clusters with homogeneous labels retain only the highest-resolution representative; those with label conflicts are entirely removed. Erroneous or outlier images are identified via nearest-neighbor similarity thresholds and removed. A stratified split into train:val:test sets ensures label/site balance and eliminates data leakage.

Best practices derived from this process include:

  • Use self-supervised metrics for duplicate detection and verify candidates above similarity thresholds (e.g., τ0.95\tau\geq0.95).
  • Apply union-find for cluster merging and strict removal of label-inconsistent clusters.
  • Remove non-dermatological and structurally outlier images before model evaluation.
  • Release mapping metadata, code, and standardized splits for reproducibility.
  • Quantify the cleaning effect using ΔAcc\Delta Acc (cleaned versus original test accuracy).

6. Access, Usage, and Community Impact

Fitzpatrick17k–C is hosted on Hugging Face Datasets, facilitating standardized programmatic access:

1
2
3
4
5
from datasets import load_dataset
ds = load_dataset("Digital-Dermatology/CleanPatrick")
print(ds)  # shows splits (usually 'train' with all images)
sample = ds["train"][0]
print(sample.keys())

All images are distributed as PIL.Image objects or file paths, with scores and binary flags per contamination type, and near-duplicate clusters indexed by duplicate_group_id. Release of this resource enables transparent experimental protocols and high-fidelity benchmarking for data-centric workflows in dermatology AI and can inform broader cleaning efforts in other image domains (Gröger et al., 16 May 2025).

A plausible implication is that systematic cleaning of other large-scale medical datasets could reveal similar inflation in reported benchmarks and necessitate recalibration of progress in medical AI model evaluation. The rigorous, expert-grounded annotation and aggregation framework employed here establishes a standard for future work in data-centric machine learning and medical imaging corpus curation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fitzpatrick17k–C Cleaned Dataset.