SWECO25 Data Cube: SDM Benchmark Dataset
- SWECO25 Data Cube is a comprehensive, high-resolution multimodal dataset for species distribution modeling across continental Europe using diverse environmental covariates.
- It integrates heterogeneous data sources—including presence-only and presence-absence records, satellite imagery, and climatic rasters—for scalable and reproducible ecological research.
- Benchmark tasks using ensemble fusion and top-K estimation demonstrate enhanced predictive accuracy, emphasizing robust evaluation across multiple data modalities.
The SWECO25 Data Cube refers to a high-resolution, multimodal, spatial-temporal dataset designed for large-scale species distribution modeling (SDM) across continental Europe, purpose-built for predicting plant communities using heterogeneous and multi-scale environmental and Earth observation covariates. It operationalizes standardized data protocols, bridging presence-only and presence-absence survey records, environmental raster layers, and time series satellite remote sensing, and offers open benchmark infrastructure for reproducible research (Picek et al., 2024).
1. Geographic and Temporal Scope
The SWECO25 Data Cube covers all of continental Europe, spanning 38 countries and incorporating all eight major biogeographic zones: Alpine, Atlantic, Boreal, Continental, Mediterranean, Pannonian, Steppic, and Arctic. The full climatological and environmental raster extent is defined in WGS84/EPSG:4326 (min_lon=–32.26°, min_lat=26.63°; max_lon=+35.58°, max_lat=72.18°). Spatial resolution varies by modality: Sentinel-2 image patches capture 128×128 pixels at 10 m GSD (1.28 km × 1.28 km), Landsat series are aggregated to per-point CSV at nominal 30 m, climatic rasters and soil properties are at ~1 km, MODIS land cover at ~500 m, human footprint at 1 km, and ASTER elevation at ~30 m. Temporal coverage includes annual Sentinel-2 composites matched to survey date, 20-year Landsat quarterly median time series (six bands, 1999–2020), monthly climatic variables from 2000–2019, and selected human footprint years (1993, 2009).
2. Data Modalities
The multimodal framework integrates three principal sources:
- Species Observations
- Presence-Only (PO): 5,079,797 records, 9,709 species, aggregated from 13 GBIF datasets. PO data spans 2017–2021, covering opportunistic and curated resources (Pl@ntNet, iNaturalist, national atlases, herbaria).
- Presence-Absence (PA): 93,703 exhaustive vegetation-plot surveys, 5,016 species, sourced from 29 EVA datasets. PA records are partitioned into 88,987 train and 4,716 test observations via 10 × 10 km block spatial splits to reduce autocorrelation.
- Environmental Covariates
- Climatic: 19 bioclimatic variables and 960 monthly rasters.
- SoilGrids: Nine physicochemical depth-layered rasters (resampled to 1 km).
- Elevation: ASTER GDEM v3 at ~30 m.
- Land Cover: MODIS MCD12Q1 (IGBP & LCCS), 13 bands.
- Human Footprint: 16 pressure-variable rasters for two time slices + aggregated indices.
- Remote Sensing Inputs
- Sentinel-2: RGB+NIR 128×128 patches, uint8 after clipping (at 10,000), rescaling ([0,1]), gamma-correction (2.5), then [0,255].
- Landsat: Point time series (six bands × 84 seasons, 1999–2020), stored as CSV and 3D tensors.
3. Data Organization, Storage, and Metadata
All raster data is in GeoTIFF format, EPSG:4326, cropped to the continental extent. Image patches are provided as JPEG/PNG or TIFF with canonical R,G,B,NIR band order. Time series and point extractions are available as CSV tables and ready-to-use NumPy/PyTorch tensors. Tabular metadata links each observation or image patch to corresponding species, coordinates, date, spatial uncertainty, and provenance, plus file pointers for associated data assets. The Malpolon Python library facilitates loading by organizing each data modality independently and indexing all assets via unified CSVs. This modular design supports scalable benchmarking and transparent provenance.
4. Benchmark Tasks, Evaluation Metrics, and Baseline Models
Benchmarking centers on multi-label community prediction for held-out PA survey blocks. The primary metrics are average per-species AUC, sample-averaged F₁ (formally: with , , based on comparison between predicted and actual species sets), and Recall@K (typically ). Baselines include:
| Modality/Model | Architecture | AUC (%) | Recall (%) | F₁s (%) |
|---|---|---|---|---|
| Climate (ResNet-6) | Deep CNN | 91.8 | 37.5 | 26.2 |
| Landsat (ResNet-6) | Deep CNN | 92.1 | 44.8 | 30.3 |
| Sentinel-2 (ResNet-6) | Deep CNN | 87.3 | 32.1 | 22.0 |
| XGBoost (4 predictors) | Gradient Boost | 90.4 | 48.8 | 28.7 |
| MaxEnt | Classical | — | — | ~0.18 |
| Multimodal Ensemble (Clim+L) | Deep, fused | 93.6 | 49.3 | 33.8 |
| Multimodal Ensemble (Clim+L+S) | Deep, fused | 94.0 | 49.7 | 34.1 |
| MME + Top-K regressor | Deep, fused | 94.0 | ~45 | ~36.2 |
This structure enables direct comparison across modalities and methodologies, revealing the additive value of ensemble fusion and top-K estimation for richness-adaptive prediction (Picek et al., 2024).
5. Protocols for Integration and Harmonization
The SWECO25 Data Cube standard informs harmonization for benchmarks such as EcoWikiRS. Sites and patches are defined using native CRS (WGS84), image patches adopt the 10 m, 128×128 convention, and JSON/CSV index-link formats support organizational consistency. Preprocessing adheres to percentile-based clipping, band rescaling, and gamma correction for images. Time series are aggregated by median per-season or month, environmental rasters standardized per band, and spatial missingness handled by imputation or masking. For model transfer and benchmarking, pretrained weights from SWECO25 are fine-tuned to target datasets, maintaining strict spatial block cross-validation to control for autocorrelation and employing the same metrics: per-species AUC, F₁s, Recall@K, and optionally True Skill Statistic ().
6. Model Architecture and Training Procedures
Models operate on independent modality encoders (typically ResNet-6, ca.~6 residual blocks per branch), concatenated into a joint feature vector , with multi-label output via a linear layer and sigmoid activation: . Species-class imbalance is addressed by positive-class weighting in a binary cross-entropy loss: Site-specific is estimated using a regressor over community richness quantiles; the predicted mean plus five species is used for top-K prediction. Optimization leverages AdamW (initial LR ≈ 1e-4), batch size ≈ 32, with learning rate decay on plateau and early stopping by F₁s.
7. Infrastructure, Open Science, and Impact
All assets—datasets, pretrained weights, and baseline method notebooks—are distributed via Kaggle, supporting rapid onboarding as well as cross-dataset comparability. Benchmark protocols, preprocessing pipelines, modeling scripts, and example notebooks are openly accessible, promoting reproducible and scalable research. By adhering to the standardized, multimodal, spatial block-controlled approach of SWECO25, derivative benchmarks such as EcoWikiRS inherit best practices for data fusion, modeling, and evaluation, and facilitate full interoperability with emerging deep SDM research in remote sensing ecology (Picek et al., 2024).