LC25000 Histopathology Dataset

Updated 10 January 2026

LC25000 is a public repository of 25,000 de-identified, RGB H&E-stained histopathology images evenly split into five classes for cancer research.
It features standardized preprocessing and augmentation protocols, including center cropping, rotations, and flips, to enhance ML model robustness.
Researchers leverage LC25000 to benchmark deep learning models, achieving high classification accuracies and advancing computational pathology.

The LC25000 dataset is a public repository comprising 25,000 de-identified color histopathology images, curated for machine learning and deep learning research in cancer pathology. Released by Borkowski et al. (Borkowski et al., 2019), the dataset consists of five balanced classes—colon adenocarcinoma, benign colon tissue, lung adenocarcinoma, lung squamous cell carcinoma, and benign lung tissue. All images are RGB, H&E-stained patches (768×768 pixels), and have undergone rigorous annotation and HIPAA-compliant de-identification, making LC25000 a prominent benchmark for automated histopathological image analysis.

1. Dataset Composition and Class Semantics

LC25000 contains 25,000 JPEG histopathology images, evenly distributed across five classes ( $N_{c_i} = 5{,}000$ , $i=1,\dots,5$ ):

Class Index	Histologic Entity	Folder Name	N (images)
c₁	Colon adenocarcinoma	colon_aca	5,000
c₂	Benign colon tissue	colon_n	5,000
c₃	Lung adenocarcinoma	lung_aca	5,000
c₄	Lung squamous carcinoma	lung_scc	5,000
c₅	Benign lung tissue	lung_n	5,000

Images are stored in a hierarchical directory (lung_colon_image_set/) with organ-specific subfolders. Every class has exactly 20% representation, such that $p(c_i) = 0.20$ for all $i$ (Borkowski et al., 2019, Ezuma et al., 3 Jan 2026, Saremi et al., 2 Sep 2025, Garg et al., 2021).

2. Image Acquisition, File Format, and Metadata

Original images were scanned from de-identified, HIPAA-compliant pathology glass slides. The acquisition protocol includes:

Modality: bright-field (light microscopy)
Staining: hematoxylin and eosin (H&E)
Initial resolution: scanned at 1024×768 px; final standard crop is 768×768 px
Color channels: standard RGB ( $C=3$ )
File format: JPEG (typical quality ≈ 90%); some studies report lossless PNG/TIFF as alternatives (Ezuma et al., 3 Jan 2026)
No explicit magnification level specified in the dataset release

All images are stripped of metadata; no patient identifiers remain. Final tensor form: $\mathbf{I} \in \mathbb{R}^{768 \times 768 \times 3}$ (Borkowski et al., 2019).

3. Data Preprocessing and Augmentation Protocols

LC25000 was constructed from 750 raw images (250 per lung class, 250 per colon class), expanded through stochastic augmentation as follows:

Center-crop to 768×768 px
Random rotation: angle $\theta \sim \mathcal{U}(-25^\circ, +25^\circ)$ with probability 1.0
Horizontal and vertical flips: $p=0.5$ each
No stain normalization or color-space augmentation beyond rotations/flips (Borkowski et al., 2019, Mangal et al., 2020)

Downstream experiments frequently adapt images for model input, with resizing to 224×224 px (Garg et al., 2021, Guo et al., 2022, Saremi et al., 2 Sep 2025, Adekunle et al., 18 Oct 2025), or to 299×299 px for Inception architectures (Ezuma et al., 3 Jan 2026). Additional metrics include pixel-value scaling to $[0,1]$ and ImageNet-style normalization.

Complex augmentation pipelines superimpose geometric (rotation, flipping, scaling, shear), color/photometric (contrast, brightness, color jitter, Gaussian blur/noise) transforms during training (Garg et al., 2021, Saremi et al., 2 Sep 2025). Some studies eschew augmentation entirely, processing images “as is” (Adekunle et al., 18 Oct 2025).

4. Annotation, Validation, and Ground Truth Integrity

Images in LC25000 are annotated per-class by board-certified pathologists. The dataset documentation states “validation” by experts, but does not publish inter-observer reliability statistics such as Cohen’s $\kappa$ (Borkowski et al., 2019). No region-of-interest masks, bounding boxes, or patient-level metadata are included; only categorical (slide-level) class labels are provided uniformly across studies (Ezuma et al., 3 Jan 2026, Adekunle et al., 18 Oct 2025, Chaddad et al., 2 Jul 2025).

5. Data Partitioning for Machine Learning Experiments

The official LC25000 data release does not prescribe explicit training/validation/test splits. Consequently, studies have adopted a variety of stratified partitioning schemes:

Hold-out splits: e.g., 60% train, 20% val, 20% test (Ezuma et al., 3 Jan 2026, Guo et al., 2022)
Three-way splits (lung-only): e.g., 68% train, 17% val, 15% test (Adekunle et al., 18 Oct 2025)
Per-class splits (e.g., 75%/9%/15% train/val/test on random subsets) (Chaddad et al., 2 Jul 2025)
K-fold cross-validation or leave-one-scanner-out procedures (recommended for batch-effect or domain generalization studies) (Borkowski et al., 2019)

All partitioning approaches preserve strict class balance.

6. Applications and Model Benchmarking

The LC25000 dataset underpins a large body of deep learning research for cancer histopathology:

CNN and transformer-based architectures (ResNet, InceptionResNet-v2, VGG, Xception, MobileNet, DenseNet169, Vision Transformer) (Garg et al., 2021, Guo et al., 2022, Adekunle et al., 18 Oct 2025, Ezuma et al., 3 Jan 2026)
Hybrid architectures combining convolutional networks with transformer modules, graph-attention, capsule networks (Saremi et al., 2 Sep 2025)
Feature engineering: integration of HOG descriptors with deep features, block-wise histogramming, and concatenation protocols (Ezuma et al., 3 Jan 2026)

Notable performance benchmarks (see Table):

Paper (arXiv)	Model/Strategy	Classification Accuracy (%)	Test Protocol
(Ezuma et al., 3 Jan 2026)	InceptionResNet-v2	96.01	5-class, 60/20/20 split
(Ezuma et al., 3 Jan 2026)	Deep features + NN	99.84	HOG + deep feature fusion
(Saremi et al., 2 Sep 2025)	HG-TNet (hybrid)	96.0	5-class, aggressive augmentation
(Chaddad et al., 2 Jul 2025)	ResNet+ (subset)	98.14	3-class, train/val/test split
(Adekunle et al., 18 Oct 2025)	ResNet-50 (lung)	98.8	3-class, 68/17/15 split
(Guo et al., 2022)	Vision Transformer	100.0	3-class, 5-shot finetuning
(Garg et al., 2021)	8 CNNs	96–100	5-class, 80/20 + val split

Evaluation metrics include precision, recall, F1-score (macro/weighted), AUROC, and confusion matrices. Multi-class ROC AUCs consistently exceed 0.97 per class (Saremi et al., 2 Sep 2025, Ezuma et al., 3 Jan 2026, Guo et al., 2022).

7. Accessibility, Licensing, and Usage Restrictions

LC25000 is distributed as a ~1.85 GB archive via multiple public repositories:

Original GitHub: https://github.com/tampapath/lung_colon_image_set/
Figshare mirror: https://figshare.com/articles/dataset/Lung_and_Colon_Cancer_Histopathological_Image_Dataset_LC25000_/11504065

The dataset is freely available to AI researchers for non-clinical research and educational use, with no explicit open-source license file. Attribution to Borkowski et al. (Borkowski et al., 2019) is requested in publications. All images are de-identified and HIPAA-compliant; no patient re-identification is possible under U.S. regulatory standards.

Research employing LC25000 typically capitalizes on its uniform class balance, expert annotation, and rich augmentation protocols to benchmark classifier performance, develop domain-adapted pipelines, and advance computational pathology methodologies. The dataset remains a central resource for validating multi-class, organ-specific cancer detection algorithms and for methodological experimentation in digital histopathology.