AML-Cytomorphology Dataset for Hematopathology
- AML-Cytomorphology dataset is a comprehensive, expertly annotated collection of peripheral blood single-cell images focused on diagnosing AML and related neoplasms.
- It comprises over 1 million high-quality images from a rigorously curated cohort of patients and healthy donors, annotated using WHO 2022 criteria.
- The dataset underpins advanced AI studies, offering robust evaluation protocols and explainability tools through transformer-based and quantum machine learning methodologies.
The AML-Cytomorphology dataset is a large-scale, expertly annotated collection of peripheral blood single-cell images acquired for the diagnosis and research of hematological malignancies, with a primary focus on acute myeloid leukemia (AML) and related neoplasms. This dataset underpins recent advances in explainable AI-based diagnostics, classical deep learning, and quantum machine learning for hematopathology, as demonstrated in studies such as "cAItomorph: Transformer-Based Hematological Malignancy Prediction from Peripheral Blood Smears in a Real-Word Cohort" (Dasdelen et al., 23 Sep 2025) and quantum benchmarks in (Bano et al., 26 Jan 2026). It is characterized by broad diagnostic coverage, multimodal ground truth, rigorous split protocols, and reproducible public availability.
1. Structural Composition and Cohort Definition
The dataset comprises peripheral blood single-cell images sourced from the Munich Leukemia Laboratory (MLL) between 2021 and 2022. In its most comprehensive form, the raw cohort includes 6,115 newly diagnosed patients and 495 healthy donors, totaling 6,610 cases (Dasdelen et al., 23 Sep 2025). After exclusion of post-chemotherapy follow-ups, uncertain or ambiguous diagnoses, double/in-between diagnostic entities, and rare/undetectable conditions, the cleaned cohort for model development consists of 2,059 patients, contributing 1,003,702 single-cell images. An extended test set partially reintroduces borderline cases totalling 1,386 patients.
Detailed diagnostic labeling encompasses 168 initial diagnosis codes, algorithmically grouped by domain experts into 22 refined classes (e.g., AML, B-cell neoplasm [CLL], CMML) and 7 coarse categories:
- Acute leukemia (n=231)—AML forms 81.8% of this class,
- Lymphoma (n=286),
- Myelodysplastic syndrome (n=194),
- MDS/MPN overlap (n=113),
- Myeloproliferative neoplasms (n=241),
- Plasma cell neoplasm (n=277),
- No malignancy (n=712; includes healthy donors, reactive states).
Patient-level metadata includes age, sex, and standard hematological parameters (WBC, RBC, platelets).
2. Annotation Protocol and Ground Truth Construction
Image-level diagnostic labels are assigned by consensus among board-certified hematopathologists adhering to international standards, including WHO 2022 criteria (Dasdelen et al., 23 Sep 2025, Bano et al., 26 Jan 2026). Ground truth incorporates multimodal evidence from bone marrow cytomorphology, immunophenotyping (flow cytometry), cytogenetics (karyotyping, FISH), and molecular genetics (NGS panels). AML cases include all defined WHO subtypes, such as AML with RUNX1::RUNX1T1 and NPM1-mutated forms.
Quality assurance is maintained by exclusion of poor-quality smears (<50 images per patient) and real-time domain expert review of automated segmentation/color adjustment scripts. No reported formal inter-rater agreement statistics; consensus labeling is trusted given longstanding MLL protocols.
3. Imaging Acquisition and Preprocessing Procedures
Peripheral blood smears use Wright–Giemsa staining for cAItomorph (Dasdelen et al., 23 Sep 2025) or May–Grünwald–Giemsa for quantum studies (Bano et al., 26 Jan 2026), followed by automated microscopy:
- Overview scans at 10× objective (Metasystems Metafer platform),
- Single-cell acquisition at 40× or 100× oil-immersion objectives.
Images are extracted using Metafer for threshold-based segmentation and logarithmic color transformations, yielding high-quality center-cropped TIFF patches of size 144×144 pixels (~478±83 cells/patient) for cAItomorph, and resized to 64×64 pixels (grayscale) for quantum benchmarking. No additional normalization or geometric augmentation is performed for cAItomorph; quantum studies apply implicit pixel normalization ().
Engineered feature vectors () for quantum experiments comprise intensity statistics, GLCM-based texture, cell morphology, edge metrics via Sobel filtering, and frequency characteristics derived from 2D FFT. PCA reduces dimensionality to 4 for variational quantum circuits, with rescaling to .
4. Label Grouping, Cohort Splitting, and Data Subsets
Diagnosis codes are mapped to hierarchical classes: 168 raw codes to 22 detailed and 7 coarse labels. AML is defined per WHO, while ambiguous or overlapping cases are re-evaluated in extended test cohorts.
For model development, five-fold cross-validation is strictly organized at the patient level to eliminate data leakage: the cleaned cohort is partitioned into equal patient subsets; for each fold, 60% is used for training, 20% for validation, and 20% for testing. Patients appear in only one test set per run (Dasdelen et al., 23 Sep 2025). For quantum benchmarks, random stratified subsampling is conducted over image-level balanced sets, with sizes varying per class and a fixed seed for reproducibility, followed by an 80/20 train/test split (Bano et al., 26 Jan 2026).
5. Feature Encoding, Instance Aggregation, and Model Integration
Three feature encoders have been benchmarked on the dataset (Dasdelen et al., 23 Sep 2025):
- ResNet-34, pretrained on 21k WBC images—a tensor output,
- Vision Transformer (ViT), supervised pretrained— vector,
- DinoBloom foundation model, self-supervised on 380k+ WBC images— vector (primary encoder for cAItomorph).
Multiple instance learning aggregates cell embeddings into patient-level vectors using attention-weighted sums:
with normalized attention weights,
where (, embedding dimension).
Transformer-based aggregation in cAItomorph takes cell tokens plus a [CLS] token, using a 2-layer transformer with 8 attention heads per layer, producing the final [CLS] output as a patient vector.
Quantum learning studies generate 20-dimensional engineered features, further reduced by PCA to 4D, which serve as input for 4-qubit variational quantum circuits (VQCs), or directly for energy-based Equilibrium Propagation (EP) approaches.
6. Evaluation Protocol, Performance Metrics, and Explainability
Top-1 classification accuracy across 7 coarse disease classes reaches (means.d., 5-fold CV) using cAItomorph, with F1 scores of for acute leukemia, for myeloproliferative neoplasms, and for no malignancy. Top-2 accuracy increases to . Quantum methods, under severe data and resolution constraints, achieve 86.4% accuracy for EP and a stable 83.0% accuracy for VQC even at 50 samples/class, compared to 98% for classical CNNs at 250 samples/class (Bano et al., 26 Jan 2026).
Explainability in cAItomorph includes cell-level attention maps via Attention Rollout and pixel-level heatmaps via Score-CAM; quantum studies deploy feature importances by construction. Calibration is validated via reliability diagrams (ECE ≈ 5.35%); model outputs are calibrated softmax logits. Bone marrow aspiration recommender logic uses per-class sensitivity thresholds to ensure sensitivity for acute leukemia remains at or above 100%, optimizing false discovery rate (FDR) without sacrificing recall.
7. Accessibility, Licensing, and Reproducibility Assurance
All code, dataset splits, pretrained model weights, and configuration details for cAItomorph are slated for release under permissive academic licenses (CC BY-NC or MIT, see GitHub repository at https://github.com/marrlab/cAItomorph), with fixed random seeds and fully documented environments for reproducibility (Dasdelen et al., 23 Sep 2025). Only open-source frameworks are required (e.g., PyTorch, scikit-learn, Metafer export). Quantum learning benchmarks used Qiskit for classical simulation, and split protocols rely on random stratification and early stopping (internal validation) (Bano et al., 26 Jan 2026).
A plausible implication is that the AML-Cytomorphology dataset sets a new standard for multimodal, reproducible hematopathology benchmarking, with proven utility for both classical and quantum model paradigms, robust cross-validation standards, broad diagnostic coverage, and open accessibility for the technical community.