Poseidon Dataset: Multi-Domain Scientific Data
- Poseidon Dataset is a collection of open-source, physics-informed datasets covering global seismology, PDE simulations, and exoplanet aerosol properties.
- It leverages large-scale data, including 2.8 million seismic events and millions of PDE trajectories, to facilitate reproducible research and robust model generalization.
- Researchers can integrate these standardized datasets via accessible formats and detailed metadata to enhance machine learning workflows and domain-specific analyses.
The term "Poseidon Dataset" designates several major open-source scientific datasets across varied domains, most notably large-scale global earthquake catalogs for physics-informed seismology (Kriuk et al., 5 Jan 2026), simulation-based operator learning corpora for partial differential equations (PDEs) (Herde et al., 2024), and compositional aerosol optical property libraries for exoplanet atmospheric retrievals (Mullens et al., 2024). Each dataset is architected for rigorous, data-driven modeling, frequently integrated with domain-specific physical constraints or equations to maximize scientific utility and reproducibility.
1. Definitions and Scope
The Poseidon Dataset for seismology comprises the largest openly accessible earthquake catalog for machine learning and physics-informed analyses, aggregating 2,833,766 seismic events over 30 years (1990–2019) (Kriuk et al., 5 Jan 2026). In computational physics and machine learning, the Poseidon Dataset refers to a collection of simulation-generated trajectories for operator learning, featuring millions of solutions for diverse PDEs, facilitating generalization across PDE classes and heterogeneous physical regimes (Herde et al., 2024). In planetary science, the Poseidon aerosol database provides comprehensive Mie-scattering optical properties for ∼100 condensate species from laboratory measurements, enabling detailed atmospheric retrieval modeling for exoplanet spectra (Mullens et al., 2024).
2. Structure and Contents
a. Seismology: Global Earthquake Catalog
- Event Coverage: 2.8 million events, full latitude (−90° to +90°) and longitude (−180° to +180°), magnitude continuum M0.0–M9.1.
- Temporal Resolution: Uniform sampling, ISO 8601 timestamps.
- Attributes (per event): 30 fields, including core identifiers (event_id, time, location, depth, magnitude), metadata (event_type, tsunami_flag, review_status), and quantitative observation metrics (nsta, az_gap, rms, error estimates, significance).
- Energy Features: Pre-computed columns E and log10_E via Gutenberg–Richter scaling, [Joule], physically linearizing the magnitude space for direct integration into modeling.
- Spatial Indices: Discretized onto 180×360 grid (1-degree resolution).
- Format: Apache Parquet; sub-tables for energy features and grid indices, directly loadable via Pandas/pyarrow or HuggingFace Datasets API.
b. Physics-Informed Operator Learning for PDEs
- Data Splits:
- Pretraining Suite: 6 fluid-dynamics solution operators (Euler and Navier–Stokes), , .
- Downstream Benchmarks: 15 heterogeneous PDE tasks—fluid, wave, reaction-diffusion, elliptic, aerodynamics.
- Size: ~29,280 pretraining trajectories, 11–21 snapshots per trajectory, effective training pairs expanded by semigroup all-2-all augmentation to >5.1 million.
- Governing Equations: Full spectrum of PDEs—Navier–Stokes, Euler, Allen–Cahn, wave, Poisson, Helmholtz, airfoil flow.
- Discretizations: Spatial grids 128×128 (with select tasks up/downsampled), periodic/Dirichlet/freestream boundary conditions, time steps (pretrain), –$0.1$ (downstream).
- File Layout: HDF5 or NumPy (.npy/.npz), tensors (time-dependent), with metadata for parameter labels and coordinate grids.
c. Exoplanet Aerosol Mie-Scattering Library
- Species: ~100 types, seven categories (super-hot condensates, M–L/T–Y dwarf clouds, Fe/Mg/Si/oxide phases, ices, soots, hazes).
- Optical Properties: ASCII refractive-index files , precomputed Mie databases (HDF5), arrays for wavelength, radius, , single-scattering albedo , asymmetry .
- Radius/Wavelength Grids: –m (log-linear, 1000 points), –m (, ~29,000 points).
- Directory Tree: reference_data/aerosols/{database_index.json, species.ri, species.miesim.h5}; JSON metadata for coverage.
- Integration: Retrieval codes interpolate these properties in log–log space for stability; support slab, fuzzy-deck, hybrid cloud parameterizations.
3. Data Generation, Augmentation, and Quality Metrics
- Seismic Catalog: No event filtering in public release; users commonly apply thresholds (nsta ≥ 4, az_gap ≤ 180°, rms ≤ 1.0 s, err_depth ≤ 10 km) to refine catalogs for high-precision analysis. Completeness varies by region and epoch—the southern hemisphere <M3.5 and locations with sparse network coverage are less complete.
- PDE Solution Dataset: Data synthesized from high-fidelity solvers (spectral, finite volume, finite difference/element). Initial conditions randomized (Fourier, Gaussians, Riemann problems, Brownian bridges). Semigroup augmentation exploits time-order pairs , upscaling sample trajectories to input-output pairs.
- Aerosol Database: Laboratory-indexed refractive-index sources, meticulous coverage of temperature, polymorph, wavelength bounds. Each species metadata includes measurement provenance for reproducibility.
4. Canonical Use Cases and Modeling Protocols
- Seismology: Designed for aftershock sequence identification, tsunami-potential screening, and foreshock pattern detection. Standard workflow involves event quality filtering, energy feature normalization, grid-based aggregation for convolutional architectures, and weighted sampling or focal loss to address tsunami-event imbalance.
- PDE Operator Learning: Enables sample-efficient learning and generalization. Users preprocess via normalization to , mask unused channels, and embed parameters/coordinates for meta-learning tasks. Downstream evaluations encompass a taxonomically broad PDE suite (parabolic, hyperbolic, elliptic, steady/unsteady, non/linear).
- Exoplanet Retrievals: Aerosol models calibrated to physical cloud structures, matching spectral signatures in transmission/emission/reflection. Best practices include verifying refractive-index coverage, appropriate cloud parameterization (slab vs. deck), interpolation in log–log space, and explicit treatment of thermal plus starlight multiple scattering.
5. Known Limitations and Extension Strategies
- Seismology: Potential biases from early network gaps, completeness variation, and depth uncertainty (offshore events). Users should set magnitude-of-completeness thresholds (e.g., globally) for accurate magnitude-frequency analysis. For further refinement, cross-matching with ISC/USGS for missing arrivals and stricter event filtering for anthropogenic sources (quarry, explosions) are recommended. Augmentation with local stress-transfer or Coulomb-failure metrics is possible for higher-order physics constraints.
- PDE Datasets: Downsampling/interpolation may affect fine-scale dynamics, especially when transferring between mesh resolutions. Choice of normalization and channel masking impacts multi-task/transfer learning.
- Aerosol Library: Wavelength and particle-size grid bounds must be carefully respected; extrapolation risks modeling artifacts. Uncertainty in laboratory refractive-index data may propagate to retrievals; transparent provenance mitigates interpretational errors.
6. Public Access, Data Format, and Community Adoption
All Poseidon datasets are released for unrestricted scientific use. The seismology catalog is hosted at https://huggingface.co/datasets/BorisKriuk/Poseidon, supporting Parquet and full Python ecosystem compatibility (Kriuk et al., 5 Jan 2026). The PDE operator learning suite and downstream tasks are available as part of the PDEgym benchmark collection at https://huggingface.co/camlab-ethz and via GitHub (Herde et al., 2024). The aerosol database is distributed with self-describing files and documented metadata for fast access and integration (Mullens et al., 2024).
This widespread accessibility, rigorous metadata documentation, and adherence to domain-specific physical principles have established the Poseidon datasets as reference standards for physics-informed machine learning in seismology, computational physics, and atmospheric sciences.