Montipora Capitata Thermal Stress Dataset
- The dataset is a multi-omics resource capturing coral responses to controlled heat stress, addressing P ≫ N challenges with 90,579 features and 13 complete samples.
- Data from transcriptomics, proteomics, metabolomics, and microbiome layers are processed using neural-network intrinsic gradient-saliency scoring and biological weighting to reduce features to 1,300.
- Benchmarking using LOOCV and robust metrics like AUROC validates the dataset's utility in testing machine learning models tailored for small-sample, high-dimensional biological data.
The Montipora capitata Thermal Stress Dataset is a multi-omics resource originating from a controlled experiment on the reef-building coral Montipora capitata, designed to support research into the molecular response to marine heatwaves. Featuring four omics modalities collected across multiple laboratories, the dataset exemplifies extreme conditions (90,579 features, 13 complete samples). It has served as a central benchmarking resource for novel machine learning frameworks in small-sample, high-dimensional settings, particularly for domain-aware feature selection and federated learning approaches in biological data (Victor, 31 Dec 2025).
1. Experimental Design and Treatment Regimes
Fragments from four adult Montipora capitata colonies were subjected to two temperature conditions: ambient control (≈26 °C) and heat-stress (ramped to ≈32 °C) to simulate a bleaching-inducing marine heatwave. At three matched timepoints (pre-stress, mid-stress, post-stress), biological replicates (2–3 per colony per condition) were sampled, resulting in four parallel omics cohorts. Each sample was immediately flash-frozen post-collection and allocated to one of four specialist laboratories for omics profiling. The supervised learning task is a binary classification: specimens are labeled “stressed” or “control” by treatment assignment.
2. Sample Cohorts and Negative Controls
The dataset's final multi-omics intersection comprises 13 Montipora capitata fragments with non-missing data across all four omics layers. The full original collection included:
| Omics Layer | Initial Cohort Size | Platform |
|---|---|---|
| Transcriptomics | 21 | RNA-seq |
| Proteomics | 14 | LC-MS/MS |
| Metabolomics | 12 | GC-MS |
| Microbiome | 12 | 16S rRNA sequencing |
A simple set intersection yielded 13 samples with complete data. Negative controls consisted of ambient-temperature fragments harvested at each matching timepoint, serving as controls for heat-stress and bleaching.
3. Data Modalities, Feature Space, and Dimensionality
The raw dataset spans features distributed across four omics types:
- Transcriptomics: 62,038 genes (expression counts via RNA-seq)
- Proteomics: 4,054 protein abundances (LC-MS/MS)
- Metabolomics: 12,055 metabolites (GC-MS)
- Microbiome: 12,432 operational taxonomic units (OTUs, post–0.1% abundance filter)
After preprocessing and domain-aware feature selection, the feature set is reduced by ≈98.6% to 1,300 features. Complete dimensionality details are outlined below.
| Modality | Raw Features | Post-selection Features |
|---|---|---|
| Transcriptomics | 62,038 | ⩽500 |
| Proteomics | 4,054 | ⩽500 |
| Metabolomics | 12,055 | ⩽500 |
| Microbiome | 12,432 | ⩽500 |
| Combined | 90,579 | 1,300 |
This reduction scheme addresses the statistical challenges of the scenario.
4. Preprocessing and Quality Control
Data were exported from the original spreadsheets (Data S1–S5) to TSV format via scripted conversions, retaining full numeric precision. Feature name harmonization was performed to remove special characters while preserving biological identifiers. Sample alignment across modalities relied on intersection of sample IDs, enforcing matched specimens. Microbiome tables were filtered to exclude ASVs below a 0.1% abundance threshold, reducing features from 27,807 to 12,432. Missing values in the proteomics layer (131 values) were imputed as zero, and normalization was performed per-omics using a StandardScaler (zero mean, unit variance), fitted exclusively on the training fold in leave-one-out cross-validation (LOOCV). Notably, no log-transform or additional variance filtering was applied before feature selection.
5. Domain-aware Feature Selection
To address the regime, Victor (2024) employed a neural-network–intrinsic gradient-saliency feature selection procedure independently on each omics layer:
- Gradient-saliency scoring: For each omics , a shallow encoder (input → 128 → 64) was initialized. For every training sample , forward propagation computed gradients of all embedding components with respect to input features, i.e., . Per-feature raw saliency scores were obtained by averaging the absolute gradient magnitude across training samples:
- Biological weighting: Layer-specific prior weights reflected mechanistic distance to transcriptional regulation:
- Transcriptomics ()
- Proteomics ()
- Metabolomics ()
- Microbiome ()
Weighted saliency:
- Top-K selection: Features within each layer were ranked by ; the top per omics retained. balanced performance and model complexity. Aggregation across the four layers yielded 1,300 combined features.
This procedure produced robust top-feature sets across parameter settings.
6. Model Evaluation, Metrics, and Controls
Classification performance was quantified primarily using area under the ROC curve (AUROC) as
with true labels and model scores .
Secondary metrics included accuracy, precision, recall, and -score. Models were evaluated in LOOCV (13 folds: 11 train, 1 validation, 1 test per fold). Statistical comparisons applied paired two-tailed -tests on AUROC across five random seeds , corroborated by Wilcoxon signed-rank tests. Effect size was measured using Cohen’s :
Negative-control experiments (label permutation) resulted in AUROC ≈ 0.26, indicating no data leakage and that signal detection arises from genuine biological labeling.
7. Public Availability and Reproducibility
Processed Montipora capitata multi-omics data are available via Zenodo (DOI 10.5281/zenodo.6861688) and all code, preprocessing scripts, federated learning configuration files, and feature masks are provided at https://github.com/samvictordr/domain-aware-vfl. This enables full reproduction of outcomes and facilitates further method benchmarking.
A plausible implication is that the dataset constitutes a reference benchmark for testing feature selection, dimensionality reduction, and federated learning algorithms in the multi-omics, small-sample regime. The experimental design and annotation facilitate negative control and signal validation crucial for studies aiming to distinguish methodological artifacts from bona fide biological phenomena (Victor, 31 Dec 2025).