Papers
Topics
Authors
Recent
Search
2000 character limit reached

LC25000 Histopathology Dataset

Updated 10 January 2026
  • LC25000 is a public repository of 25,000 de-identified, RGB H&E-stained histopathology images evenly split into five classes for cancer research.
  • It features standardized preprocessing and augmentation protocols, including center cropping, rotations, and flips, to enhance ML model robustness.
  • Researchers leverage LC25000 to benchmark deep learning models, achieving high classification accuracies and advancing computational pathology.

The LC25000 dataset is a public repository comprising 25,000 de-identified color histopathology images, curated for machine learning and deep learning research in cancer pathology. Released by Borkowski et al. (Borkowski et al., 2019), the dataset consists of five balanced classes—colon adenocarcinoma, benign colon tissue, lung adenocarcinoma, lung squamous cell carcinoma, and benign lung tissue. All images are RGB, H&E-stained patches (768×768 pixels), and have undergone rigorous annotation and HIPAA-compliant de-identification, making LC25000 a prominent benchmark for automated histopathological image analysis.

1. Dataset Composition and Class Semantics

LC25000 contains 25,000 JPEG histopathology images, evenly distributed across five classes (Nci=5,000N_{c_i} = 5{,}000, i=1,,5i=1,\dots,5):

Class Index Histologic Entity Folder Name N (images)
c₁ Colon adenocarcinoma colon_aca 5,000
c₂ Benign colon tissue colon_n 5,000
c₃ Lung adenocarcinoma lung_aca 5,000
c₄ Lung squamous carcinoma lung_scc 5,000
c₅ Benign lung tissue lung_n 5,000

Images are stored in a hierarchical directory (lung_colon_image_set/) with organ-specific subfolders. Every class has exactly 20% representation, such that p(ci)=0.20p(c_i) = 0.20 for all ii (Borkowski et al., 2019, Ezuma et al., 3 Jan 2026, Saremi et al., 2 Sep 2025, Garg et al., 2021).

2. Image Acquisition, File Format, and Metadata

Original images were scanned from de-identified, HIPAA-compliant pathology glass slides. The acquisition protocol includes:

  • Modality: bright-field (light microscopy)
  • Staining: hematoxylin and eosin (H&E)
  • Initial resolution: scanned at 1024×768 px; final standard crop is 768×768 px
  • Color channels: standard RGB (C=3C=3)
  • File format: JPEG (typical quality ≈ 90%); some studies report lossless PNG/TIFF as alternatives (Ezuma et al., 3 Jan 2026)
  • No explicit magnification level specified in the dataset release

All images are stripped of metadata; no patient identifiers remain. Final tensor form: IR768×768×3\mathbf{I} \in \mathbb{R}^{768 \times 768 \times 3} (Borkowski et al., 2019).

3. Data Preprocessing and Augmentation Protocols

LC25000 was constructed from 750 raw images (250 per lung class, 250 per colon class), expanded through stochastic augmentation as follows:

  • Center-crop to 768×768 px
  • Random rotation: angle θU(25,+25)\theta \sim \mathcal{U}(-25^\circ, +25^\circ) with probability 1.0
  • Horizontal and vertical flips: p=0.5p=0.5 each
  • No stain normalization or color-space augmentation beyond rotations/flips (Borkowski et al., 2019, Mangal et al., 2020)

Downstream experiments frequently adapt images for model input, with resizing to 224×224 px (Garg et al., 2021, Guo et al., 2022, Saremi et al., 2 Sep 2025, Adekunle et al., 18 Oct 2025), or to 299×299 px for Inception architectures (Ezuma et al., 3 Jan 2026). Additional metrics include pixel-value scaling to [0,1][0,1] and ImageNet-style normalization.

Complex augmentation pipelines superimpose geometric (rotation, flipping, scaling, shear), color/photometric (contrast, brightness, color jitter, Gaussian blur/noise) transforms during training (Garg et al., 2021, Saremi et al., 2 Sep 2025). Some studies eschew augmentation entirely, processing images “as is” (Adekunle et al., 18 Oct 2025).

4. Annotation, Validation, and Ground Truth Integrity

Images in LC25000 are annotated per-class by board-certified pathologists. The dataset documentation states “validation” by experts, but does not publish inter-observer reliability statistics such as Cohen’s κ\kappa (Borkowski et al., 2019). No region-of-interest masks, bounding boxes, or patient-level metadata are included; only categorical (slide-level) class labels are provided uniformly across studies (Ezuma et al., 3 Jan 2026, Adekunle et al., 18 Oct 2025, Chaddad et al., 2 Jul 2025).

5. Data Partitioning for Machine Learning Experiments

The official LC25000 data release does not prescribe explicit training/validation/test splits. Consequently, studies have adopted a variety of stratified partitioning schemes:

All partitioning approaches preserve strict class balance.

6. Applications and Model Benchmarking

The LC25000 dataset underpins a large body of deep learning research for cancer histopathology:

Notable performance benchmarks (see Table):

Paper (arXiv) Model/Strategy Classification Accuracy (%) Test Protocol
(Ezuma et al., 3 Jan 2026) InceptionResNet-v2 96.01 5-class, 60/20/20 split
(Ezuma et al., 3 Jan 2026) Deep features + NN 99.84 HOG + deep feature fusion
(Saremi et al., 2 Sep 2025) HG-TNet (hybrid) 96.0 5-class, aggressive augmentation
(Chaddad et al., 2 Jul 2025) ResNet+ (subset) 98.14 3-class, train/val/test split
(Adekunle et al., 18 Oct 2025) ResNet-50 (lung) 98.8 3-class, 68/17/15 split
(Guo et al., 2022) Vision Transformer 100.0 3-class, 5-shot finetuning
(Garg et al., 2021) 8 CNNs 96–100 5-class, 80/20 + val split

Evaluation metrics include precision, recall, F1-score (macro/weighted), AUROC, and confusion matrices. Multi-class ROC AUCs consistently exceed 0.97 per class (Saremi et al., 2 Sep 2025, Ezuma et al., 3 Jan 2026, Guo et al., 2022).

7. Accessibility, Licensing, and Usage Restrictions

LC25000 is distributed as a ~1.85 GB archive via multiple public repositories:

The dataset is freely available to AI researchers for non-clinical research and educational use, with no explicit open-source license file. Attribution to Borkowski et al. (Borkowski et al., 2019) is requested in publications. All images are de-identified and HIPAA-compliant; no patient re-identification is possible under U.S. regulatory standards.


Research employing LC25000 typically capitalizes on its uniform class balance, expert annotation, and rich augmentation protocols to benchmark classifier performance, develop domain-adapted pipelines, and advance computational pathology methodologies. The dataset remains a central resource for validating multi-class, organ-specific cancer detection algorithms and for methodological experimentation in digital histopathology.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LC25000 Dataset.