Papers
Topics
Authors
Recent
Search
2000 character limit reached

HEST-1k Spatial Omics Dataset

Updated 17 February 2026
  • HEST-1k is a comprehensive multimodal dataset that pairs spatial transcriptomic profiles with high-resolution histological images across 26 organs.
  • It provides over 2.1 million expression–morphology pairs and quantifies approximately 76 million nuclei using cutting-edge segmentation and registration pipelines.
  • The dataset is complemented by the HEST-Library, a dedicated Python package that streamlines data access, processing, and benchmarking for integrative spatial omics analyses.

HEST-1k is a large-scale, multi-organ resource that integrates spatial transcriptomic profiles with high-resolution histological images, enabling computational approaches to the joint analysis of gene expression and tissue morphology. Assembled from 153 public and internal cohorts and covering 26 organs from human and mouse, HEST-1k pairs 1,229 spatial transcriptomics datasets with corresponding H&E-stained whole-slide images (WSIs) and systematically curated, standardized metadata. The dataset supports multimodal investigation at unprecedented scale, incorporating 2.1 million localized expression–morphology pairs and quantifying more than 76 million nuclei with detailed subclass annotation. HEST-1k is distributed with HEST-Library, a dedicated Python package for data access, tissue segmentation, registration, patch extraction, and nuclei quantification, implemented using state-of-the-art pipelines (Jaume et al., 2024).

1. Dataset Composition and Metadata Structure

HEST-1k consists of 1,229 paired spatial transcriptomic datasets and WSIs derived from 26 distinct organs, encompassing human and mouse samples and a wide spectrum of tissue states (healthy, tumor, non-cancer pathological, post-compound treated, and genetically modified/knock-out). It includes 367 cancer samples spanning 25 OncoTree-defined subtypes, with major representation from IDC, PRAD, PAAD, SKCM, COAD, READ, ccRCC, LUAD, and additional cancer types. Metadata is organized into three principal blocks: (1) generic sample descriptors (including sample_id, cohort, species, organ, OncoTree code, disease category, DOI, download link, year, and license), (2) spatial transcriptomics parameters (technology, number of genes/spots, spot metrics, total reads, gene panels), and (3) histology-related fields (file format, magnification, pixel size, tissue preparation, spatial coordinates in both pixel and tissue domains, and tissue region annotations via GeoJSON). Tissue sample provenance spans public repositories (10x Genomics, NCBI GEO, Mendeley, Zenodo, GitHub) and internal sources, with all WSIs unified to pyramidal TIFF format.

2. Spatial Registration and Multimodal Pair Generation

All spatial transcriptomic samples are systematically registered to their corresponding histology images, enabling pixel-level spatial correlation. Visium datasets utilize automated fiducial detection (YOLOv8-based) and four-corner primitive matching for affine alignment between spot coordinates (μm) and WSI pixel space, governed by

(ui vi)=A(xi yi)+b\begin{pmatrix} u_i \ v_i \end{pmatrix} = \mathbf{A} \begin{pmatrix} x_i \ y_i \end{pmatrix} + \mathbf{b}

where (xi,yi)(x_i, y_i) and (ui,vi)(u_i, v_i) are respectively the spot center in tissue μm and WSI pixel units, A\mathbf{A} encodes scaling/rotation, and b\mathbf{b} translation. For Xenium, VALIS registers DAPI transcript images onto the WSI. At each registered spot, a 224×224 px (20× equivalent) patch is extracted, yielding 2.1 million rigorously linked expression–morphology pairs. Xenium samples are tiled into 55×55 μm non-overlapping “pseudo-Visium” bins. Each pair consists of an H&E patch and an AnnData transcript-count vector, facilitating patch-level transcriptomic prediction and correlation analyses.

3. Nuclei Segmentation, Classification, and Quantification

HEST-1k incorporates comprehensive nuclear instance segmentation and classification using the CellViT model, based on a DETR framework pretrained on PanNuke, with postprocessing for small-object removal and mask refinement. Five histological nuclear classes are annotated per instance: neoplastic epithelial, non-neoplastic epithelial, inflammatory, stromal, and necrotic. Across 1,229 slides, 76.4 million nuclei are quantified (mean ≈62.1k nuclei/slide). Approximate distribution: 17.6 million neoplastic, 21.5 million stromal, 4.9 million normal epithelial, 15.4 million inflammatory, and 0.08 million necrotic nuclei. This high-resolution structural annotation supports detailed morphometric and cell-compositional analyses at tissue, spot, or patch level.

4. Processing Pipeline and HEST-Library

HEST-1k is the product of a multi-stage computational pipeline:

  1. Data harvest and WSI standardization (OpenSlide conversion).
  2. Automatic tissue and artifact masking (DeepLabV3+ ResNet50).
  3. Spatial transcriptomics data conversion (AnnData/Scanpy).
  4. Spot-to-image registration (YOLOv8-based fiducial detection, VALIS alignment).
  5. Patch extraction around each spot.
  6. Instance-level nuclear segmentation and classification (CellViT).
  7. Dataset release with patch-expression pairs, nuclei masks, and full per-sample metadata.

HEST-Library, a Python package built upon Scanpy/AnnData and OpenSlide, exposes modules for metadata querying (hest.download), WSI and expression loading (hest.io), spot alignment, tissue/patchextraction and segmentation (hest.segmentation/hest.nuclei), and batch effect analysis (hest.batch). The toolkit facilitates both interactive (Python) and command-line interrogation, including patch-level visualization, gene-based filtering, and direct integration with numpy, pandas, and PyTorch workflows. Example usage demonstrates metadata filtering, WSI loading, gene-count extraction, heatmap visualization, and segmentation-based sample stratification.

5. Foundational Benchmarks and Use Cases

HEST-1k underpins three primary computational use cases:

  • Expression Prediction from Histology (HEST-Benchmark): Nine multivariate regression tasks (top 50 highly variable genes) are defined for major cancer subtypes. Patient-stratified cross-validation is employed. Foundation models (11 encoders, including H-Optimus-0, UNIv1.5, Virchow2; ViT architectures) are benchmarked with regression heads (PCA+Ridge, Ridge, XGBoost). Best performance (PCA+Ridge, H-Optimus-0): mean r=0.4146r=0.4146. A logarithmic scaling law is observed between model size and Pearson correlation score (R=0.81,p<0.01R=0.81, p<0.01).
  • Biomarker Exploration: Example analysis overlays GATA3 expression on extracted neoplastic nuclei (IDC type), linking nuclear area (and other morphometrics) with gene counts (e.g., GATA3, FLNB, TPD52, FOXA1; $0.45 < r < 0.47$, p<104p<10^{-4}), facilitating spatially resolved in situ biomarker mapping.
  • Multimodal Representation Learning: Patch-level image-expression pairs are used to contrastively fine-tune vision (CONCH, ViT-B) and expression encoders (MLP), optimizing InfoNCE loss and supporting downstream tasks such as ER/PR/HER2 status prediction (e.g., for ER: AUC=0.884, balanced accuracy=0.752).

6. Access, Licensing, and Data Governance

HEST-1k and its supporting tools (HEST-Library, benchmark suite) are available via GitHub (https://github.com/mahmoodlab/hest); full datasets (>1 TB) are hosted on HuggingFace with comprehensive installation documentation. All materials are licensed under CC BY-NC-SA 4.0, restricting usage to non-commercial research, requiring attribution, and enforcing share-alike terms. No deployment in clinical or diagnostic applications is permitted under this license.

7. Research Context and Significance

HEST-1k represents a convergent resource for computational spatial biology, addressing prior limitations in the scale, heterogeneity, and multimodal depth of public spatial transcriptomics data. It enables benchmarking of histopathology foundation models, robust spatial biomarker discovery, and cross-modal machine learning tasks that link molecular and morphological phenotypes in health, disease, and intervention settings. By providing standardized processing and cross-sample metadata schemas, HEST-1k facilitates reproducibility for comparative studies across organs, sample preparation protocols, and spatial transcriptomics technologies. The scale and rigor of the nuclear quantification as well as spatially and transcriptionally resolved spot-patch pairs position HEST-1k as a foundation for future advances in expression–morphology modeling, spatial omics benchmarking, and integrative multi-scale analyses (Jaume et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HEST-1k Dataset.