Earth System Transformers

Updated 30 December 2025

Earth System Transformers are transformer-based architectures tailored for modeling and forecasting coupled, high-dimensional Earth system processes using geospatial and multimodal data.
They incorporate innovations such as geotokens and spherical rotary position embeddings to capture true geodesic relationships, enhancing forecasting and super-resolution tasks.
Advanced attention mechanisms, including field-space and ensemble interaction attention, allow these models to enforce physical constraints and boost prediction accuracy.

A class of machine learning architectures termed Earth System Transformers encompasses transformer-based neural networks specifically adapted for modeling, forecasting, and reconstructing data arising from the coupled, high-dimensional, and physically structured processes of the Earth system. These models are characterized by their integration of geospatial, geophysical, and multimodal datasets and employ specialized attention mechanisms, position encodings, and architectural constraints to address unique challenges in Earth observation, prediction, and super-resolution tasks.

1. Representations and Input Structures

Unlike natural language processing, conventional sequential ordering is generally not meaningful in Earth system domains. Two foundational innovations address this:

Geotokens: A geotoken comprises an input feature vector (e.g., text or CNN features) together with its precise latitude and longitude. Their semantics are determined by spatial coordinates rather than sequence, and the corresponding embedding pipeline is partitioned into a modality-specific extractor and a spatial encoder that injects spherical coordinates. This structure enables attention mechanisms to respect geodesic rather than sequential relationships (Unlu, 2024).
Multimodal Tokens for Subsurface Modeling: In multimodal geotransformer models such as "Transparent Earth," each token incorporates features, spatial coordinates (including depth), and text-derived modality embeddings. Unified positional encodings using sinusoidal bases (plus depth extension) allow integration over irregular spatial grids and support arbitrary numbers of continuous, categorical, and directional modalities (Mazumder et al., 2 Sep 2025).

2. Position Encoding and Spherical Geometry

Spatial context is injected via geometry-aware position encodings:

Spherical Rotary Position Embedding (Spherical RoPE): Standard RoPE (Su et al. 2021) imparts position by rotating token embeddings based on the sequence index. Spherical RoPE generalizes this principle, substituting token index with geographic latitude ( $\phi$ ) and longitude ( $\theta$ ), representing them as 3D Euler rotations tiled across embedding dimensions:

$R^{d}_{\phi,\theta} = \bigoplus_{j=1}^{d/3} R'_e(\phi,\theta).$

This induces a monotonic mapping between attention-weight inner products and true geodesic distances on the sphere, such that two tokens close on Earth remain closely aligned in query/key space. No learned absolute or index-based positional embeddings are required; real-world spherical angles suffice (Unlu, 2024).

Sinusoidal and Depth-Extended Positional Encodings: Alternate models encode spatial location using a frequency-band sinusoidal basis expanded to include depth for volumetric data. Frequencies are determined to satisfy the Nyquist–Shannon criterion at the grid’s spatial resolution (Mazumder et al., 2 Sep 2025).

3. Attention Mechanisms and Structure-Preserving Architectures

A diverse range of Earth system transformers implement adaptations of self-attention to support spatial, temporal, and ensemble dependencies:

Field-Space Attention: Field-Space Transformers (FSTs) compute attention directly in the physical (field) domain rather than the latent vector space, operating on multi-scale decompositions (hierarchies over HEALPix grids) of geophysical fields. These attention operations preserve the physical units, spatial locality, and conservation laws, enabling the enforcement of priors such as divergence- or energy conservation. Updates correspond to structure-preserving deformations:

$\Delta_\theta(x)(\mathbf{u}) = \int_{\mathcal{S}^2}G_\theta(\mathbf{u},\mathbf{v})\,[x(\mathbf{v})-x(\mathbf{u})]\,d\mathbf{v}.$

FSTs maintain intermediate model states as interpretable fields at each layer, allowing for direct visualization and embedding of scientific constraints (Witte et al., 23 Dec 2025).

Ensemble Interaction Attention: Self-Attentive Ensemble Transformers (SAET) apply attention over ensemble members (rather than across space or time), allowing dynamic, member-wise post-processing of probabilistic Earth system model outputs. The network directly outputs an ensemble of spatially coherent fields, preserving the multivariate spatial correlation structure and calibrating ensemble spread using a non-parametric, permutation-invariant attention mechanism (Finn, 2021).
Spatiotemporal Cuboid Attention: Models such as Earthformer apply blockwise ("cuboid") space-time attention, partitioning high-dimensional data tensors into local spatiotemporal cuboids and applying self-attention in parallel, supplemented by global vectors to capture long-range dependencies but with reduced $O(T H W b)$ complexity versus full quadratic $O(T^2 H^2 W^2)$ self-attention (Gao et al., 2022).

4. Architectural Variants and Scaling Laws

Recent advances produced highly scalable transformer architectures for continental-to-global modeling:

Dense Vision Transformers with Multi-Channel Inputs: ORBIT, a 113B-parameter foundation model, uses a ViT backbone where each channel corresponds to a distinct climate variable (surface and vertical levels). Channel-wise patch embeddings are cross-attended to aggregate multimodal physical input, and additional LayerNorm is applied to stabilize attention for extremely large models (Wang et al., 2024).
Hybrid Parallelism for Exascale Scaling: To enable training at this scale, Hybrid Sharded Tensor-Data Orthogonal Parallelism (Hybrid-STOP) shards both model weights and mini-batches across thousands of GPUs, efficiently decomposing consecutive matrix multiplications in both the forward and backward passes. This achieves sustained throughput up to 1.6 exaFLOPS and scales up with strong efficiency across heterogeneous supercomputing infrastructure (Wang et al., 2024).
Autoregressive Earth Observation Transformers: EarthPT and derivatives leverage autoregressive self-supervision over remotely sensed time series, each timestamp-and-location pair forming a token. These models incorporate explicit temporal sine/cosine date embeddings, eliminate positional embeddings, and operate in a purely autoregressive, causal attention regime. Training datasets can, in principle, reach $O(10^{15})$ tokens, making them amenable to Chinchilla-scaling and opening the pathway to near-unbounded parameter scaling for Earth system modeling (Smith et al., 2023).

5. Super-Resolution, Downscaling, and Spectral Bias Mitigation

Transformers have been systematically extended for Earth system super-resolution and downscaling:

Hybrid ViT–SIREN/INR Models: ViSIR and ViFOR combine standard ViT patch embedding and multi-layer self-attention architectures with neural implicit representation subnetworks (SIREN, FOREN). These subnetworks use trainable sinusoidal activation functions, explicitly modeling both low- and high-frequency components:

$\phi(x) = \sin(\omega x + \beta),$

where $\omega$ , $\beta$ are learnable. This mitigates spectral bias and recovers sharp subgrid-scale features, improving PSNR by up to $\theta$ 0 dB over baseline SIREN and ViT models (Zeraatkar et al., 18 Feb 2025, Zeraatkar et al., 10 Feb 2025). ViFOR further supports multi-image fusion via self-attention over temporally or spatially stacked observations, enabling superior global and local reconstruction accuracy.

Physically-Constrained Super-Resolution: Field-Space Transformers provide principled multiscale super-resolution, with conservation laws enforceable at each scale, while achieving state-of-the-art error metrics (e.g., $\theta$ 1 K RMSE vs. $\theta$ 2– $\theta$ 3 K for ViT/U-Net on global temperature upscaling) at a fraction of the parameter count (Witte et al., 23 Dec 2025).

6. Evaluation, Performance, and Applications

Empirical benchmarks consistently show the superiority of Earth system transformers relative to conventional methods:

Forecasting Skill: EPT-2 achieves substantial outperformance over operational numerical weather prediction models (ECMWF HRES, ENS mean), improving 10 m wind RMSE at 240 h by 12% and 2 m temperature by 12% relative to ensemble means, despite a much smaller ensemble size and reduced inference cost (Molinaro et al., 13 Jul 2025).
Earth System Predictability: ORBIT demonstrates both strong-scaling efficiency up to 49,152 AMD GPUs and improved anomaly correlation skill out to 10 days, with larger models achieving lower wRMSE and higher wACC (Wang et al., 2024).
Subsurface Property Reconstruction: Transparent Earth delivers >3× reduction in mean absolute error for stress angle prediction given multimodal input context, and the architecture generalizes smoothly with growing data and increased model capacity (Mazumder et al., 2 Sep 2025).

Applications span long-term Earth observation forecasting (EO time series), deterministic and probabilistic weather prediction, super-resolution downscaling, in-situ to global multimodal field reconstruction, and uncertainty-aware ensemble post-processing.

7. Interpretability, Scientific Integration, and Future Perspectives

Earth System Transformers increasingly emphasize physical interpretability and scientific utility:

Interpretability: Field-Space Transformers output physically meaningful latent fields at every layer, allowing direct analysis of multiscale anomalies and context integration. In ensemble transformers, post-processed members preserve spatial correlation patterns closely matching physically simulated ensembles (Witte et al., 23 Dec 2025, Finn, 2021).
Physics Integration: Newer architectural designs explicitly preserve or enable enforcement of field-level conservation, divergence-free structure, or energy norms, promoting scientific robustness and reliability.
Scaling Prospects: The data-supply for transformer-based Earth system modeling is, in principle, unbounded—remotely sensed EO and climate model simulation can provide quadrillion-plus-token corpora, unlocking “Large Observation Models” at the $\theta$ 4– $\theta$ 5 parameter level (Smith et al., 2023).
Continuing Research: Active research targets include the development of Earth-system scaling laws, seamless integration with physical process constraints, generative field modeling (e.g., diffusion on the sphere), improved depth and vertical encoding, and automated adaptation to arbitrary new spatial or modality inputs.

These advances collectively position Earth System Transformers as a principal computational abstraction and operational paradigm for data-driven, interpretable, and physically integrated Earth system science across the next decade.