Multicenter Lymphoma Benchmarking Dataset
- The dataset is a rigorously curated, multi-institutional resource standardizing data acquisition, annotation, and preprocessing for lymphoma diagnosis and biomarker discovery.
- It integrates histopathology, PET/CT imaging, and gene expression data with modality-specific pipelines to ensure reproducibility and robust benchmarking.
- Benchmarking protocols employing cross-validation and hierarchical modeling facilitate robust comparison of machine learning approaches amid challenges like domain shift and annotation noise.
The Multicenter Lymphoma Benchmarking Dataset refers to a family of rigorously curated, multi-institutional data resources designed to support the systematic evaluation of machine learning and computational analysis methodologies for lymphoma diagnosis and biomarker discovery across imaging and genomic modalities. These datasets represent a concerted effort to standardize data acquisition, annotation, preprocessing, benchmarking, and reporting practices for robust comparison and reproducible research in lymphoma informatics, particularly addressing generalizability and domain shift across centers and platforms.
1. Compositional Structure and Acquisition Modalities
Multicenter lymphoma benchmarking datasets exist for histopathology, PET/CT imaging, and gene expression profiling. Each variant is assembled from multiple academic hospitals or consortia, emphasizing diversity in acquisition hardware, clinical protocol, population, and disease subtype representation.
Histopathology
The first multicenter whole slide imaging (WSI) benchmark for lymphoma subtyping from HE-stained slides comprises 999 slides from 609 patients across four German academic medical centers: Munich (Technical University of Munich), Kiel (University Hospital Kiel), Augsburg (University Hospital Augsburg), and Erlangen (University Hospital Erlangen). Each slide is associated with one of five classes: chronic lymphocytic leukemia (CLL), follicular lymphoma (FL), mantle cell lymphoma (MCL), diffuse large B-cell lymphoma (DLBCL), or healthy control tissue (NEG). Slides were scanned at 40× using Aperio AT2, Hamamatsu, or Philips scanners, with downstream analysis also performed at 20× and 10× via downsampling. Per-institution class distributions vary, with all sites providing tumor classes but negative controls only present in Munich and Kiel. Labels are assigned either via board-certified pathologist audit (Munich, Kiel, Erlangen) or IHC-inferred (Augsburg), introducing known annotation noise (Umer et al., 16 Dec 2025).
PET/CT Imaging
Multicenter PET/CT resources integrate 18F-FDG PET and CT data from multiple sites, such as BC Cancer Vancouver and Seoul St. Mary's Hospital, with up to 611 scans (≈511 unique patients) in a representative benchmark (Ahamed et al., 2023, Ahamed et al., 2024). PET volumes are standardized to SUV units; CT images use Hounsfield units (HU). Axial slices are resampled and intensity-clipped for network input (e.g., 224×224 pixels for classification tasks). Slice or volume-level annotations are derived from expert delineations using tools like PETEdge or consensus protocols (STAPLE).
Gene Expression
Genomic benchmarks for diffuse large B-cell lymphoma (DLBCL) consist of integrated microarray expression data from 11 studies (n=2046 samples) from multiple North American and European sources (Bilgrau et al., 2015). Raw CEL files, reannotated and uniformly preprocessed, support downstream network meta-analysis and covariance modeling.
2. Annotation, Curation, and Quality Control
Annotation protocols are defined by modality and cohort.
- Histopathology: Pathologists review HE sections for label assignment (Munich, Kiel, Erlangen); Augsburg labels derive from IHC, introducing label noise. Artifact detection and exclusion are performed at the patch level (e.g., via Trident with DeepLabV3).
- PET/CT: Expert nuclear medicine readers employ PETEdge for tumor segmentation; multi-reader cases are fused by STAPLE. Labels (e.g., positive slice, tumor mask) are explicitly defined and curated.
- Gene Expression: Annotations originate from medical records and database curation, with meta-data including array platform, cohort, and, typically, requisite clinical parameters.
An overview of per-modality annotation and artifact handling:
| Modality | Primary Annotation | Artifact/Noise Mitigation |
|---|---|---|
| Histopathology | Pathologist/IHC | Patch-level exclusion |
| PET/CT | Expert contouring (PETEdge) | Intensity/hardware standardization, STAPLE fusion |
| Gene Expression | Meta-data/clinical curation | Quantile normalization |
3. Preprocessing Pipelines and Data Standardization
Dataset preprocessing is tightly protocolized and modality tailored.
- WSI: Trident (DeepLabV3-based) for tissue segmentation; 256×256 px non-overlapping patches at various magnifications; minor color normalization via encoder augmentations only.
- PET/CT: CT and PET images clipped within biologically motivated intensity ranges, resampled to uniform 2.0 mm3 voxels; body cropping trims extraneous volume; for slice-level tasks, PET and CT are fused or stacked into 2D multi-channel arrays.
- Gene Expression: RMA log2 transformation, quantile normalization across platforms, feature set restriction to 11,573 common (Ensembl-mapped) genes; top variance genes selected for benchmarking.
Automated command-line pipelines (e.g., https://github.com/RaoUmer/LymphomaMIL (Umer et al., 16 Dec 2025), https://github.com/microsoft/lymphoma-segmentation-dnn (Ahamed et al., 2023)) implement consistent preprocessing, patch extraction, feature embedding, and evaluation routines.
4. Ground-Truth, Metadata, and Benchmarking Protocols
Each instance in the benchmarking datasets is paired with comprehensive metadata:
- WSI: Patient ID, center, scanner, magnification, and slide-level label.
- PET/CT: Demographics (age, sex), scanner vendor, lesion characteristics (number, TMTV, TLG, SUV), detailed segmentation masks.
- Gene Expression: Accession numbers, cohort/study-specific identifiers, raw/processed expression matrices.
Evaluation protocols are standardized to facilitate robust intra- and inter-institutional comparisons:
- Histopathology: Five-fold patient-wise cross-validation (in-distribution, Munich) and external OOD validation (Kiel, Augsburg, Erlangen).
- PET/CT: Patient- or slice-level training/testing splits, center-aware vs. center-agnostic regimes, internal/external test distinction.
- Gene Expression: Study-level meta-analysis, treating each cohort as a batch; hierarchical random covariance modeling for cross-study effect adjustment.
5. Baseline Algorithms, Metrics, and Comparative Analysis
Modern multicenter lymphoma benchmarks enable rigorous model validation across a range of architectural settings.
Feature Extractors and MIL Aggregators (WSI)
- Five frozen foundation models (ResNet50, H-optimus-1, H0-mini, Virchow2, UNI2, Titan), with no fine-tuning.
- Aggregators: attention-based MIL (AB-MIL) and transformer-based TransMIL/BEL.
Performance Metrics
- Histopathology: Area under the ROC curve (AUC), macro-averaged F1-score, balanced accuracy (BACC, defined as ).
- PET/CT: Accuracy, sensitivity, specificity, precision, negative predictive value (NPV), F1-score, balanced accuracy, AUROC, AUPRC (formulas given explicitly (Ahamed et al., 2024)).
- Gene Expression: Meta-analytic covariance/correlation modeling, cophenetic correlation, Kullback-Leibler divergence.
Empirical Observations
- WSI models consistently achieve in-distribution BACC ≥ 80% (best ~82%, AUC ~0.96 for Titan+AB-MIL) across all tested magnifications; OOD test performance drops to ~60% BACC, with Augsburg degrading further due to IHC-based noisy labels.
- PET/CT benchmarks reveal overestimation under slice-level splits; patient-level, center-agnostic models best generalize by metrics such as AUROC and AUPRC.
- In gene expression, hierarchical random covariance models outperform simple pooled estimators for common network structure inference and reveal nontrivial inter-study heterogeneity (Bilgrau et al., 2015).
6. Limitations, Access, and Future Directions
Data accessibility is modality and site dependent:
- Histopathology: WSIs available upon request (subject to standard agreements), open-source pipelines and recipes released.
- PET/CT: Some cohorts (e.g. AutoPET) are public (via TCIA); other datasets require institutional approval. Code and weights for benchmarking models are publicly accessible.
- Gene Expression: All processed and raw data are deposited in public repositories (GEO), with code and documentation available for reproducibility.
Key limitations persist, notably:
- Generalization Gap: Pronounced OOD degradation (~20% drop in BACC); scanner and stain variability, cohort shifts, and annotation noise remain unsolved.
- Label Noise: Automated label inference (Augsburg IHC, some PET/CT sites) introduces systematic noise and underperforms versus pathologist consensus.
- Subsampling and Class Imbalance: Rare subtypes are underrepresented, necessitating augmentation or data rebalancing.
- Normalization and Domain Adaptation: Stain and scanner variations suggest opportunities for model domain-adaptation and normalization modules.
Recommended future activities include expansion to rarer subtypes, prospective external clinical validation, integration of normalization/domain-adaptation strategies, and hierarchical modeling of cross-center effects.
7. Significance and Influence on Lymphoma Informatics
The multicenter lymphoma benchmarking datasets establish a reference standard for computational lymphoma research, permitting robust comparison of learning algorithms, aggregation methods, and generalization capacity across centers and clinical environments. These resources and associated pipelines enable the community to evaluate model robustness, address annotation noise, dissect cohort effects, and benchmark clinical utility in a reproducible manner, catalyzing developments in automated lymphoma subtyping, lesion detection, and prognostic biomarker discovery (Umer et al., 16 Dec 2025, Ahamed et al., 2023, Bilgrau et al., 2015).