Shifted Earth Transformer (Searth)

Updated 16 January 2026

Shifted Earth Transformer is a physics-informed model that embeds Earth’s geophysical priors, such as zonal periodicity and meridional boundaries, to enhance global weather forecasting.
It introduces an innovative cyclic shift and asymmetric masking mechanism in its multi-head self-attention to account for Earth's natural geometric constraints.
The Relay Autoregressive fine-tuning strategy enables long-horizon forecasting with drastically reduced memory and compute demands compared to traditional methods.

The Shifted Earth Transformer (Searth Transformer) is a physics-informed transformer architecture designed for global medium-range weather forecasting. It directly encodes the Earth's geospheric physical priors into the model architecture, enabling physically consistent global information exchange through the incorporation of zonal periodicity and meridional boundaries. The model is coupled with a novel Relay Autoregressive (RAR) fine-tuning scheme that enables efficient long-horizon learning under constrained memory budgets. The YanTian forecasting model—built using these innovations—demonstrates state-of-the-art performance at one-degree resolution with drastically reduced computational costs compared to conventional approaches (Li et al., 14 Jan 2026).

1. Model Architecture and Core Mechanisms

YanTian operates as an encoder–core–decoder stack, employing repeated Searth Transformer blocks as its computational backbone. Each block is composed of two serial self-attention sub-blocks:

Earth-aware Multi-head Self-Attention (E-MSA): Functionally identical to Swin's window-based multi-head self-attention (W-MSA), E-MSA computes standard self-attention within non-overlapping local windows on a spatial feature map $X \in \mathbb{R}^{H \times W \times C}$ .
Shifted Earth Multi-head Self-Attention (SE-MSA): SE-MSA introduces a cyclic shift of the feature map—periodic in longitude (east-west) and non-periodic with masking in latitude (north-south)—prior to window partitioning. Attention masking is asymmetric: wrap-around east-west pairs are unmasked to encode Earth's zonal periodicity, but links crossing the poles are permanently masked.

Attention within a window $i$ uses:

$\text{Attention}_i(Q, K, V) = \text{Softmax} \left( \frac{Q_i K_i^T}{\sqrt{d}} + M_i \right) V_i$

where $M_i$ is the masking matrix. For SE-MSA, the feature map undergoes a spatial roll:

$X' = \mathrm{roll}\left( X,\, \text{shift} = (\lfloor w_h/2 \rfloor,\, \lfloor w_w/2 \rfloor),\, \text{axis} = (H, W) \right)$

Here, the roll is periodic in longitude ( $W$ ) but zero-padded or masked in latitude ( $H$ ).

The masking $M_{pq}^{SE}$ for tokens $p$ , $q$ after this shift is defined as:

$0$ if $p$ and $q$ lie within the same shifted window or are adjacent across the east-west (longitude) boundary,
$-\infty$ if $p$ and $q$ cross meridional (latitude) boundaries at the poles.

No explicit spherical positional embeddings are built into YanTian; rather, the shift-and-mask structure itself encodes positional prior on the sphere. A plausible implication is the shift-and-mask acts as an implicit positional module rather than requiring explicit latitude–longitude embeddings.

2. Integration of Geophysical Priors

The design recognizes two critical geophysical constraints:

Zonal Periodicity: By implementing a cyclic roll in the longitude axis, windows that straddle the dateline (0°/360°) are seamlessly merged and unmasked. This ensures east–west continuity in global attention patterns and information flow, without additional learned parameters.
Meridional Boundaries (Poles): The absence of cyclic rolling in latitude, and strict masking of attention links that cross the poles, enforces the Earth's physical boundary conditions. Attention computation at the poles strictly prohibits wrap-around, thus preserving the natural structure of atmospheric flows constrained by geography.

The formal definition for SE-MSA masking between tokens $p = (i_p, j_p)$ and $q = (i_q, j_q)$ is:

$\Delta \phi = |i_p - i_q|,\, \Delta \lambda = |j_p - j_q| \pmod{W}$

with

$M_{pq}^{SE} = \begin{cases} 0 & \text{if}~\left\lfloor \Delta \phi/w_h \right\rfloor = 0\,\wedge\,\left\lfloor \Delta \lambda/w_w \right\rfloor = 0~\text{(same shifted window)} \ & \text{or}~(\Delta\lambda \geq W-w_w \land\, \text{same window index latitudinally}) \ -\infty & \text{otherwise, if } \Delta \phi \geq w_h~\text{(polar violation)} \end{cases}$

3. Relay Autoregressive Fine-Tuning Strategy

The RAR scheme is introduced to overcome the prohibitive memory and compute demands of classical autoregressive fine-tuning, which require storing all $T$ intermediate states for backpropagation. RAR decomposes the forecast horizon ( $T = M \cdot k$ for $M$ stages of length $k$ ) and fine-tunes each sub-stage independently. After each sub-stage, the hidden state is relayed into the next with computational graph detachment.

RAR Pseudocode

for each sample sequence {X_0, X_1, ..., X_T}:
    X_in = (X_0, X_1)
    for s in range(1, M+1):
        L_s = 0
        X_prevprev, X_prev = X_in
        for t in range(1, k+1):  # k autoregressive steps
            X_pred = f_theta(X_prevprev, X_prev)
            L_s += loss(X_pred, X_{s*k + t})
            X_prevprev, X_prev = X_prev, X_pred
        L_s.backward()         # backprop over k steps only
        theta -= eta * grad_theta(L_s)
        detach(X_pred)         # cut graph
        X_in = (X_prev, X_pred)   # relay for next sub-stage

RAR fine-tuning requires memory

O(k)

, independently of total horizon

T

. By detaching the graph and relaying only the hidden state, RAR enables training over very long forecast sequences (e.g., 15 days) on commodity hardware. This suggests scalable training for other chaotic spatiotemporal systems is feasible under RAR.

4. Empirical Performance and Computational Efficiency

YanTian, equipped with Searth blocks and RAR fine-tuning, achieves notable improvements in mid-range global weather forecast metrics:

Skillful Forecast Lead Time (Z500): Whereas ECMWF HRES reaches an anomaly correlation coefficient (ACC) of 0.6 at approximately 9 days, YanTian extends this to 10.3 days. Competing AI models (GraphCast, FuXi, PanGu) exhibit lead times in the 9.5–10.2 day range.
Multi-variable Forecast Accuracy: At one-degree resolution, YanTian matches or slightly exceeds contemporary AI baselines in upper-air and surface metrics (U₁₀, V₁₀, T₂M, MSL, Z₅₀₀, T₅₀₀, U₅₀₀, V₅₀₀), maintaining ACC > 0.6 up to ~9–10.3 days and demonstrating RMSE curves comparable or superior to state-of-the-art.
Resource Utilization: Classical AR5d training (5 days, 20 steps) demands ≈80 GB GPU memory and ~200 hours wall-clock. RAR5d reduces resource consumption: <25 GB and ~3 hours for equivalent forecast. Overall, RAR achieves ~200× lower memory × time product compared to standard AR fine-tuning, with only minor short-horizon trade-offs.

5. Generalization and Algorithmic Implications

The asymmetric shift-and-mask paradigm of Searth Transformer generalizes beyond atmospheric science. Its principles apply wherever the modeled manifold possesses one periodic dimension (e.g., longitude, circumpolar ocean flows) and one bounded dimension (e.g., latitude, vertical depth). RAR fine-tuning decouples training sequence length from memory, allowing tractable learning of long-horizon dynamics for other auto-regressive, chaotic spatiotemporal systems such as ocean waves, sea ice, and solar wind forecasting.

This suggests broader applicability of Searth and RAR to predictive modeling across geophysical, planetary, and climate-related domains.

6. Context Within Weather and Climate Modeling

Searth Transformer represents an evolution in physics-informed neural architectures for structured geophysical data. By encoding Earth's geospheric priors (zonal continuity, meridional boundaries) directly into transformer attention mechanics, the model achieves physically consistent global exchange without ad hoc heuristic masking or expensive learned positional parameters. Its adoption of RAR fine-tuning facilitates scalable model training for medium to long-range forecasting—central challenges in operational meteorology and climate modeling.

YanTian, as implemented with Searth and RAR, sets new efficiency and accuracy benchmarks for global mid-range numerical weather prediction (NWP) at comparable or lower computational costs than traditional NWP systems and contemporary AI approaches (Li et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Searth Transformer: A Transformer Architecture Incorporating Earth's Geospheric Physical Priors for Global Mid-Range Weather Forecasting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shifted Earth Transformer (Searth Transformer).