ABC Database for Atmospheric and Exoplanetary Studies

Updated 12 January 2026

The Atmospheric Big Challenge (ABC) Database is a community-curated corpus that aggregates synthetic, experimental, and model-based datasets to support research in atmospheric chemistry and exoplanet habitability.
Its datasets are rigorously structured with CF-compliant NetCDF4/HDF5 formats and ML-ready features, enabling systematic benchmarking of physical models and retrieval algorithms.
The database drives practical applications from climate modeling and habitability mapping to plasma source diagnostics, fostering advancements in both theoretical and applied atmospheric research.

The Atmospheric Big Challenge (ABC) Database is a community-curated, publicly accessible corpus of synthetic, experimentally derived, and model-based datasets designed to support advanced research on atmospheric composition, chemical diversity, radiative processes, and planetary habitability. It functions as a scalable infrastructure for benchmarking ML workflows, developing statistical inference algorithms, and probing the multi-dimensional parameter space of atmospheric physics and chemistry. Major ABC initiatives encompass Earth and exoplanetary atmospheres, atmospheric organic molecular databases, planetary plasma sources at atmospheric pressure, and radiative transfer modeling, each with rigorously defined variable sets and metadata, underpinning applications from climate model development to exoplanet atmospheric retrievals and ML-driven aerosol analysis (Chopra et al., 2023, Sandström et al., 2024, Changeat et al., 2022, Horia-Eugen et al., 2020, Hoz, 2014).

1. Scope and Structure of the ABC Database

The ABC Database aggregates a hierarchy of datasets reflecting distinct aspects and scales of atmospheric research:

Bulk atmospheric models: PyATMOS provides 124,314 1-D radiative-photochemical-climate-converged atmospheric profiles for solar-analog/Earth-analog exoplanets, spanning systematic variations in O₂, CO₂, H₂O, CH₄, H₂, and N₂ (Chopra et al., 2023).
Organic compound chemistries: Atmospheric organics (e.g., Wang, Gecko, Quinone) are curated with explicit molecular-descriptor coverage, atomic ratios, and functional group frequencies, emphasizing ML compatibility (Sandström et al., 2024).
Spectral retrieval benchmarks: The ESA-Ariel ABC subset comprises 105,887 forward spectra with corresponding 26,109 multi-species Bayesian retrieval posteriors, mapped to a realistic population of exoplanets with Tier-2 Ariel-simulated noise (Changeat et al., 2022).
Plasma device and chemical source models: Atmospheric-pressure plasma source measurements and models, including electron/gas temperature diagnostics and species-resolved kinetics, are encoded for micro-ICP design and atmospheric processing applications (Horia-Eugen et al., 2020).
Radiative background and climate datasets: Datasets encompass traditional solar/IR metrics and cosmic microwave background (CMB) contributions, compliant with World Meteorological Organization (WMO) references and supporting Bayesian environmental modeling (Hoz, 2014).

The database structure typically employs CF-compliant NetCDF4/HDF5 primary archives; relational/CSV index tables for quick querying; and explicit, API-accessible metadata fields for chemical, physical, contextual, and provenance attributes.

2. Parameterization, Grid Sampling, and Molecular Diversification

Bulk Atmospheres

The core PyATMOS grid scans five gases (O₂, CO₂, H₂O, CH₄, H₂) linearly in mixing ratio, with N₂ as a fill gas, yielding a six-nested-loop, non-logarithmic parameterization. Table 1 demonstrates the precise scan domains:

Constituent	Range (mixing ratio)	Increment
O₂	0.00–0.30	0.02
O₂	0.30–1.00	0.05
CO₂	0.00–0.10	0.01
CO₂	0.10–1.00	0.05
H₂O	0.00–0.90	0.05
CH₄	0.00–0.10	0.005
H₂	0.00–1×10⁻⁷	1×10⁻⁹

No log-sampling is used; linear grids allow systematic, high-resolution exploration of steady-state atmospheric regimes (Chopra et al., 2023).

Organic Compounds

Atmospheric organic datasets (e.g., Gecko, Wang, Quinone) are computationally fingerprinted using 2048-bit RDKit topological and 167-bit MACCS key vectors, ensuring machine-interpretable similarity metrics. Coverage is quantified by mean non-H atom count and atomic ratios; representative functional group frequencies (hydroxyl, carbonyl, carboxylic acid, hydroperoxide, nitrate) are tabulated commonly above 10% presence, systematically distinguishing atmospheric from non-atmospheric molecular sets (Sandström et al., 2024).

Spectral Grids and Retrievals

The ESA-Ariel ABC dataset spans log-uniform priors for H₂O, CO, CO₂, CH₄, and NH₃ over 7–8 orders of magnitude, with isothermal radiative-convective profiles discretized over 100 pressure layers, and spectrum simulation at 52 wavelengths per planet (Changeat et al., 2022).

3. Computation, Modeling, and Data Products

Physical and Chemical Modeling

PyATMOS-ATMOS chain: Steady-state solution of coupled 1-D photochemistry and radiative-convective equilibrium, using hydrostatic balance,

$\frac{dP}{dz} = -\rho(z)\,g,$

two-stream radiative transfer, and vertical eddy diffusion,

$\Phi_i(z) = -K_{zz}(z)\,\rho(z)\,\frac{\partial}{\partial z}\left[\frac{n_i(z)}{\rho(z)}\right].$

Each atmosphere is iterated until $\Delta T < 1$ K criterion is met (Chopra et al., 2023).

Bayesian spectral retrievals: Nested sampling (MultiNest, 200 live points) is employed for posteriors, with likelihood

$L(D|\Theta) = \prod_{j=1}^{N_\lambda} \frac{1}{\sqrt{2\pi\sigma_j^2}}\exp\left[-\frac{(D_j-S_j(\Theta))^2}{2\sigma_j^2}\right]$

and Bayesian evidence computed via cumulative prior-volume weights (Changeat et al., 2022).

Similarity-based curation: Organic datasets are mapped using Tanimoto and MACCS similarities in a high-dimensional fingerprint space, PCA-reduced, and clustered using t-SNE; gap regions are flagged for candidate data augmentation (Sandström et al., 2024).

Output Variables

Atmospheric profiles: Temperature $T(z)$ , pressure $P(z)$ , major and trace species mixing ratios $X_i(z)$ , radiative fluxes $F_{\uparrow,\downarrow}(z)$ , and sustaining surface fluxes $\Phi_i(0)$ (Chopra et al., 2023).
Molecular properties: Fingerprints, atomic ratios, functional group vectors, vapor pressures, partition coefficients, experimental/computational method tags (Sandström et al., 2024).
Spectra and retrievals: Transit depth arrays, σ(λ), true parameters, and posterior traces (Changeat et al., 2022).
Plasma source characteristics: nₑ(P_abs), gas temperature T_g(P_abs), device geometry, matching-network data, and diagnostic method tags (Horia-Eugen et al., 2020).
Radiative balance variables: Direct, diffuse, global solar radiation, CMB monopole and anisotropy coefficients, ozone column, and Bayesian radiative balance components (Hoz, 2014).

4. Data Access, Formatting, and Metadata Conventions

File structures prioritize scientific reuse:

Full-profile physical models: NetCDF4/HDF5, with paths such as /ABC-Atmos/model_xxxxx/profile.nc and summary.json; index tables provide rapid query capability (Chopra et al., 2023).
ML dataset organization: All_data.csv and observations.hdf5 for bulk metadata and spectra, all_targets.hdf5 for posterior traces; Level 2 auxiliary CSVs for competition-grade benchmarking (Changeat et al., 2022).

Metadata follows CF-1.6 conventions, with explicit units, variable names (e.g., mol_fraction_O2), method tags, and provenance, supporting pipeline-integrated querying and API access. Example programmatic access is via the NASA Exoplanet Archive’s TAP API or Astroquery TAP interface. Versioning, schema compliance, and provenance tracking are recommended for organic molecular data (Sandström et al., 2024).

5. Applications and Use Cases

The ABC Database is foundational for several research agendas:

Habitability mapping: Rapid mapping of $(X_{\rm CO2}, X_{\rm CH4}) \mapsto T_{\rm surf}$ allows first-pass screening for exoplanetary surface habitability (Chopra et al., 2023).
Atmospheric retrieval benchmarking: Enables systematic evaluation of retrieval frameworks (nested sampling, ML emulators, hybrid approaches) on physically realistic data (Changeat et al., 2022).
Machine learning in atmospheric chemistry: ML-ready fingerprints, property annotation, and chemical space coverage quantification foster the development and transfer-domain testing of ML models in atmospheric molecule identification and aerosol formation studies (Sandström et al., 2024).
Plasma source modeling/chemistry: Micro-ICP datasets inform atmospheric-pressure chemical source design, plasma diagnostics, and simulation validation (Horia-Eugen et al., 2020).
Climate/radiative analysis: CMB data integration and statistically rigorous (Bayesian) data/model fusion improve radiative climate modeling, enabling quantification of background radiative influences (Hoz, 2014).

6. Recommendations for Database Construction and Expansion

Key lessons and recommendations from ABC-related literature:

Molecular datasets: Prioritize multi-property coverage, computational/experimental labeling, standard metadata, and open/vetted APIs. Careful data augmentation based on similarity metrics (S ≥ 0.1), with chemical property constraints, is essential to avoid domain-drift artifacts (Sandström et al., 2024).
Physical/chemical models: Ensure steady-state convergence, well-documented parameter grids, and traceability to published physical/chemical models (Chopra et al., 2023).
Open infrastructure: Favor version control, collaborative curation, and deposition to public repositories (e.g., Aerosolomics, NASA Exoplanet Archive) (Chopra et al., 2023, Sandström et al., 2024).
Validation: Cross-validation, comparison with satellite/ground-based observations, and modular update mechanisms are recommended for all ABC branches (Hoz, 2014).
Schema and access: Use extensible, well-documented schemas (JSON/CSV/NetCDF4), with REST and TAP APIs to maximize accessibility and downstream integration (Chopra et al., 2023, Changeat et al., 2022).

A plausible implication is that continued community-driven expansion of the ABC Database, guided by similarity-based gap analysis and convergence-tested physical modeling, will significantly accelerate both ML-driven and physically motivated discovery in atmospheric science and planetary studies.