Papers
Topics
Authors
Recent
Search
2000 character limit reached

Montipora Capitata Thermal Stress Dataset

Updated 7 January 2026
  • The dataset is a multi-omics resource capturing coral responses to controlled heat stress, addressing P ≫ N challenges with 90,579 features and 13 complete samples.
  • Data from transcriptomics, proteomics, metabolomics, and microbiome layers are processed using neural-network intrinsic gradient-saliency scoring and biological weighting to reduce features to 1,300.
  • Benchmarking using LOOCV and robust metrics like AUROC validates the dataset's utility in testing machine learning models tailored for small-sample, high-dimensional biological data.

The Montipora capitata Thermal Stress Dataset is a multi-omics resource originating from a controlled experiment on the reef-building coral Montipora capitata, designed to support research into the molecular response to marine heatwaves. Featuring four omics modalities collected across multiple laboratories, the dataset exemplifies extreme PNP \gg N conditions (90,579 features, 13 complete samples). It has served as a central benchmarking resource for novel machine learning frameworks in small-sample, high-dimensional settings, particularly for domain-aware feature selection and federated learning approaches in biological data (Victor, 31 Dec 2025).

1. Experimental Design and Treatment Regimes

Fragments from four adult Montipora capitata colonies were subjected to two temperature conditions: ambient control (≈26 °C) and heat-stress (ramped to ≈32 °C) to simulate a bleaching-inducing marine heatwave. At three matched timepoints (pre-stress, mid-stress, post-stress), biological replicates (2–3 per colony per condition) were sampled, resulting in four parallel omics cohorts. Each sample was immediately flash-frozen post-collection and allocated to one of four specialist laboratories for omics profiling. The supervised learning task is a binary classification: specimens are labeled “stressed” or “control” by treatment assignment.

2. Sample Cohorts and Negative Controls

The dataset's final multi-omics intersection comprises 13 Montipora capitata fragments with non-missing data across all four omics layers. The full original collection included:

Omics Layer Initial Cohort Size Platform
Transcriptomics 21 RNA-seq
Proteomics 14 LC-MS/MS
Metabolomics 12 GC-MS
Microbiome 12 16S rRNA sequencing

A simple set intersection yielded 13 samples with complete data. Negative controls consisted of ambient-temperature fragments harvested at each matching timepoint, serving as controls for heat-stress and bleaching.

3. Data Modalities, Feature Space, and Dimensionality

The raw dataset spans P=90,579P=90,579 features distributed across four omics types:

  • Transcriptomics: 62,038 genes (expression counts via RNA-seq)
  • Proteomics: 4,054 protein abundances (LC-MS/MS)
  • Metabolomics: 12,055 metabolites (GC-MS)
  • Microbiome: 12,432 operational taxonomic units (OTUs, post–0.1% abundance filter)

After preprocessing and domain-aware feature selection, the feature set is reduced by ≈98.6% to 1,300 features. Complete dimensionality details are outlined below.

Modality Raw Features Post-selection Features
Transcriptomics 62,038 ⩽500
Proteomics 4,054 ⩽500
Metabolomics 12,055 ⩽500
Microbiome 12,432 ⩽500
Combined 90,579 1,300

This reduction scheme addresses the statistical challenges of the PNP \gg N scenario.

4. Preprocessing and Quality Control

Data were exported from the original spreadsheets (Data S1–S5) to TSV format via scripted conversions, retaining full numeric precision. Feature name harmonization was performed to remove special characters while preserving biological identifiers. Sample alignment across modalities relied on intersection of sample IDs, enforcing N=13N=13 matched specimens. Microbiome tables were filtered to exclude ASVs below a 0.1% abundance threshold, reducing features from 27,807 to 12,432. Missing values in the proteomics layer (131 values) were imputed as zero, and normalization was performed per-omics using a StandardScaler (zero mean, unit variance), fitted exclusively on the training fold in leave-one-out cross-validation (LOOCV). Notably, no log-transform or additional variance filtering was applied before feature selection.

5. Domain-aware Feature Selection

To address the PNP \gg N regime, Victor (2024) employed a neural-network–intrinsic gradient-saliency feature selection procedure independently on each omics layer:

  1. Gradient-saliency scoring: For each omics kk, a shallow encoder Ek\mathcal{E}_k (input → 128 → 64) was initialized. For every training sample ii, forward propagation computed gradients of all embedding components with respect to input features, i.e., xEk(x(i))Rdk×64\nabla_{\mathbf{x}} \mathcal{E}_k(\mathbf{x}^{(i)}) \in \mathbb{R}^{d_k \times 64}. Per-feature raw saliency scores Ij(k)I_{j}^{(k)} were obtained by averaging the absolute gradient magnitude across training samples:

Ij(k)=1Ntraini=1NtrainxjEk(x(i))I_{j}^{(k)} = \frac{1}{N_{\mathrm{train}}} \sum_{i=1}^{N_{\mathrm{train}}} |\nabla_{x_j} \mathcal{E}_k(\mathbf{x}^{(i)})|

  1. Biological weighting: Layer-specific prior weights wkw_k reflected mechanistic distance to transcriptional regulation:
    • Transcriptomics (wT=1.5w_T=1.5)
    • Proteomics (wP=1.0w_P=1.0)
    • Metabolomics (wM=0.8w_M=0.8)
    • Microbiome (wμ=0.5w_\mu=0.5)

Weighted saliency: I~j(k)=wkIj(k)\tilde{I}_j^{(k)} = w_k\,I_j^{(k)}

  1. Top-K selection: Features within each layer were ranked by I~j(k)\tilde{I}_j^{(k)}; the top KK per omics retained. K=500K=500 balanced performance and model complexity. Aggregation across the four layers yielded 1,300 combined features.

This procedure produced robust top-feature sets across parameter settings.

6. Model Evaluation, Metrics, and Controls

Classification performance was quantified primarily using area under the ROC curve (AUROC) as

AUROC=1N+Ni:yi=1j:yj=0[1{si>sj}+121{si=sj}]\mathrm{AUROC} = \frac{1}{N_+ N_-} \sum_{i:y_i=1} \sum_{j:y_j=0} [ \mathbf{1}\{ s_i > s_j \} + \frac{1}{2}\mathbf{1}\{ s_i = s_j \} ]

with true labels yi{0,1}y_i \in \{0,1\} and model scores sis_i.

Secondary metrics included accuracy, precision, recall, and F1F_1-score. Models were evaluated in LOOCV (13 folds: 11 train, 1 validation, 1 test per fold). Statistical comparisons applied paired two-tailed tt-tests on AUROC across five random seeds {42,123,456,789,1337}\{42,123,456,789,1337\}, corroborated by Wilcoxon signed-rank tests. Effect size was measured using Cohen’s dd:

d=xˉ1xˉ2spooled,spooled=σ12+σ222d = \frac{\bar x_1 - \bar x_2}{s_{\mathrm{pooled}}}, \quad s_{\mathrm{pooled}} = \sqrt{\frac{\sigma_1^2 + \sigma_2^2}{2}}

Negative-control experiments (label permutation) resulted in AUROC ≈ 0.26, indicating no data leakage and that signal detection arises from genuine biological labeling.

7. Public Availability and Reproducibility

Processed Montipora capitata multi-omics data are available via Zenodo (DOI 10.5281/zenodo.6861688) and all code, preprocessing scripts, federated learning configuration files, and feature masks are provided at https://github.com/samvictordr/domain-aware-vfl. This enables full reproduction of outcomes and facilitates further method benchmarking.

A plausible implication is that the dataset constitutes a reference benchmark for testing feature selection, dimensionality reduction, and federated learning algorithms in the multi-omics, small-sample regime. The experimental design and annotation facilitate negative control and signal validation crucial for studies aiming to distinguish methodological artifacts from bona fide biological phenomena (Victor, 31 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Montipora Capitata Thermal Stress Dataset.