DERM12345 Dermatoscopic Imaging Dataset
- DERM12345 is a comprehensive, multi-institutional dataset featuring 12,345 high-resolution dermatoscopic images annotated via a four-tier hierarchical diagnostic taxonomy.
- The dataset supports a range of tasks from binary malignancy detection to 40-way fine-grained subclass classification, enabling rigorous evaluation of machine learning models.
- Collected over 12 years across tertiary clinics in Türkiye, it includes expert annotations, stratified splits, and preprocessing guidelines for robust deep learning applications.
The DERM12345 dataset is a multi-institutional dermatoscopic image resource comprising 12,345 high-resolution images of skin lesions, annotated across a hierarchical diagnostic taxonomy. Designed to enable nuanced evaluation of machine learning models in dermatological imaging, DERM12345 supports both coarse malignancy detection and fine-grained differential diagnosis with 40 annotated lesion subclasses. Data collection spanned 12 years across tertiary clinics in Türkiye, encompassing a diverse patient population and imaging modalities. This dataset is established as the basis for advanced benchmarking, including hierarchical model evaluation frameworks in recent foundation model studies (Yilmaz et al., 2024, Yuceyalcin et al., 18 Jan 2026).
1. Dataset Structure and Taxonomy
DERM12345 contains dermatoscopic photographsāeach corresponding to a unique lesionāsampled from 2008ā2020 in Manisa and Istanbul. The image resolutions are device-dependent (ranging from 2000āĆā1500 to 3840āĆā2160 pixels), requiring downstream resizing for deep learning pipelines.
Images are annotated according to a four-tier clinical taxonomy:
| Level | # Labels | Examples |
|---|---|---|
| Subclass | 40 | Acral Nodular Melanoma, Seborrheic Keratosis, BCC, SCC |
| Main Class | 15 | Compound Nevus, Melanoma |
| Superclass (4 types) | 4 | Melanocytic Benign/Malignant, Non-melanocytic Benign/Malignant |
| Binary Malignancy | 2 | Malignant, Benign |
The subclass definitions enable fine-grained tasks crucial for differential diagnosis, moving beyond traditional binary (malignant vs benign) paradigms. These tiers underpin the hierarchical benchmarking protocols instituted in downstream studies (Yuceyalcin et al., 18 Jan 2026).
2. Data Acquisition and Sources
Image acquisition utilized a combination of MoleMax HD, FotoFinderĀ® videodermatoscopes, and 3Gen DermLite DL4 devices attached to either mobile or DSLR cameras. All cases originated from clinical workflows in three dermatology centers:
- Celal Bayar University, Manisa
- Istinye University ā Liv Hospital Vadistanbul, Istanbul
- University of Health Sciences ā HaydarpaÅa Numune Hospital, Istanbul
No publicly available datasets were included, ensuring patient diversity typical of the EuropeāAsia transition zone (Fitzpatrick skin types IIāIV). Modalities included both polarized and non-polarized dermoscopy.
3. Annotation Protocol and Expert Agreement
Initial annotation combined automated extraction and manual metadata review, performed by trained engineers. Board-certified dermatologists with over 20 years of dermoscopy experience (G.G., S.P.Y.) rendered consensus diagnoses on all cases. Malignant lesions were mandatorily biopsy-confirmed; benign and dysplastic diagnoses were validated by eitherāā„ā2 years of digital follow-up or clinical consensus.
Discrepancies were adjudicated directly; a formal Cohenās Īŗ metric calculation is possible by random subsampling:
where and denote observed and expected agreement, respectively.
4. Metadata Schema and File Format
All images are stored in JPEG (.jpg, 8-bit) with standardized naming:
DERM38_<CenterCode><DeviceCode><PatientID>_<ImageID>.jpg
Metadata is maintained in CSV, with these principal fields:
- file_name
- super_class
- main_class
- subclass_label
- patient_id (anonymized)
- lesion_location (e.g., ādorsum of handā)
- device_type (e.g., āMoleMaxHDā)
- capture_date
- biopsy_confirmed (yes/no) 10. follow_up_months
This rich schema supports flexible downstream stratification and cohort selection.
5. Data Splitting and Preprocessing
A stratified data partition is recommended to preserve subclass frequency distribution. Commonly implemented splits are:
- Training: 70% of images
- Validation: 15%
- Test: 15%
Alternatively, dedicated benchmarks employ a 9,860/2,485 split (train/test) with five-fold cross-validation on the training set (stratification by subclass label) (Yuceyalcin et al., 18 Jan 2026).
Preprocessing steps for deep learning include resizing images to target resolutions (e.g., 224Ć224 or 256Ć256), followed by normalization (ImageNet mean/std or model-specific constants). During benchmarking, embeddings are precomputed from fixed crops; no real-time augmentation or cropping is utilized at the embedding step.
6. Hierarchical Benchmarking and Evaluation
DERM12345 underpins a four-level hierarchical evaluation, as formalized in recent foundation model studies (Yuceyalcin et al., 18 Jan 2026). Models are trained at the finest subclass level (40-way), with aggregate predictions derived via probability summation over child subclasses per parent label:
Evaluation metrics are:
- Primary: Weighted F1-Score
where is the number of classes and is the normalized support.
- Secondary: Balanced Accuracy (identifies failures on rare subclasses).
Key benchmark findings include the "granularity gap": general-purpose medical vision models (e.g., MedImageInsights) achieve high binary screening accuracy (97.52% Weighted F1) but lower subclass discrimination (~65.5%), while models pretrained for dermatology (e.g., Derm Foundation, MedSigLip, MONET) reach higher subclass accuracy (~69.5%) yet trail in coarse labels.
7. Code Examples and Practical Usage
PyTorch and TensorFlow data pipeline implementations are provided to facilitate robust usage:
PyTorch Sample:
1 2 3 4 5 6 7 8 9 10 |
from sklearn.model_selection import train_test_split import pandas as pd meta = pd.read_csv("DERM38_metadata.csv") train_df, temp_df = train_test_split( meta, test_size=0.30, stratify=meta['subclass_label'], random_state=42 ) val_df, test_df = train_test_split( temp_df, test_size=0.50, stratify=temp_df['subclass_label'], random_state=42 ) |
8. Key Insights and Limitations
DERM12345 exhibits significant class imbalance: benign nevi (Compound, Junctional) comprise ~40% of cases, while rare malignancies individually constitute <1%. Notable subclass confusionsāsuch as between dysplastic and banal compound nevi ("blob problem")āpersist even in top-performing models, with ā„25% embedding overlap observed.
For clinical applications, coarse grain tasks (binary, superclass) are approachable by general medical vision encoders, but fine-grained classification (40-way) demonstrably requires specialized dermatology pretraining or large-scale medical embedding strategies. Diverse adapter architectures (MLP, XGBoost, SVM) are critical to assess representation quality; MLP adapters consistently yield optimal results for fine classification.
A plausible implication is that model selection in algorithmic dermatology must be domain- and task-specific, matching representation strategy to diagnostic granularity demanded by the target workflow.
DERM12345 represents a rigorous, hierarchically structured, and expertly annotated benchmark resource for dermatologic machine learning, facilitating standardized evaluation across the full spectrum of clinical diagnostic tasks (Yilmaz et al., 2024, Yuceyalcin et al., 18 Jan 2026).