Non-Stationary Transformers

Updated 18 January 2026

Non-stationary Transformers are models designed to handle data with shifting statistical properties, addressing mean-drift, heteroscedasticity, and regime changes.
They integrate techniques like de-stationary attention, wavelet-based multi-scale decomposition, and continuous-time mechanisms to adapt to evolving temporal and spatial patterns.
Empirical benchmarks show these models achieve state-of-the-art performance in time-series forecasting, reinforcement learning, and spatial prediction, while maintaining efficient computational scaling.

Non-stationary Transformers are a class of Transformer architectures and training frameworks explicitly adapted to handle data whose statistical properties—mean, variance, distributional support, structural dependencies—change over time or space. These models address limitations of standard self-attention when deployed on dynamic environments, real-world time series with evolving patterns, or spatial fields exhibiting location-dependent behavior. Non-stationary variants span modular stationarization pipelines, wavelet decomposition, meta-learning over shifting priors, hierarchically stochastic generative blocks, convolution-guided attention, and explicit continuous-time mechanisms, establishing state-of-the-art performance in time-series forecasting, reinforcement learning, and spatial prediction.

1. Mathematical Foundations of Non-stationarity in Sequence Modeling

A process is (weakly) stationary if its first two moments are time-invariant: $E[x_t] = \mu$ , $\mathrm{Var}[x_t] = \sigma^2$ , and $\mathrm{Cov}[x_t, x_{t+k}] = \gamma_k$ . Real-world data often violates these through mean-drift, heteroscedasticity, discontinuities, or temporally/spatially localized regime shifts. In non-stationary reinforcement learning, dynamics and reward functions $(R_t, T_t)$ change over time, and performance is evaluated via dynamic regret $\mathfrak{R}_{\mathrm{alg}}(T) = \sum_{t=1}^T [r_t^* - \mathbb{E}[r_t]]$ , where $r_t^*$ is the per-step oracle reward (Chen et al., 22 Aug 2025).

In time series, piecewise-stationary sources are modeled as $x_{1:n} = \prod_{(a,b)\in P} \mu^{f(a)}(x_{a:b})$ over unknown temporal partitions $P$ and segment parameters $\theta_k$ (Genewein et al., 2023). These are sampled hierarchically—Bernoulli segments parameterized by $\theta_k \sim \mathrm{Beta}(\frac{1}{2},\frac{1}{2})$ , or multivariate trends decomposed by wavelet transforms (Sasal et al., 2022, Hu et al., 2024). In spatial learning, nonstationarity is typified either by scale-driven effects (correlation length exceeding domain size) or deterministic additive trends $Y(x) = m(x) + \varepsilon(x)$ , with variance ratio $\alpha = \mathrm{Var}[m(x)] / \mathrm{Var}[Y(x)]$ quantifying the trend strength (Liu et al., 2022).

2. Architectural Strategies for Non-stationary Data

Non-stationary Transformers modify canonical attention architectures to preserve, re-inject, or adapt to dynamic behaviors.

Series Stationarization and De-stationary Attention: Inputs are normalized per series, producing $x'_t = (x_t - \mu_X)/\sigma_X$ ; de-normalization restores scale at output. To avoid "over-stationarization," de-stationary attention layers learn multiplicative ( $\tau \sim \sigma_X^2$ ) and additive offsets ( $\Delta \sim \mu_Q^\top K^\top$ ) via small MLPs, enhancing distinguishability and diversity in softmax attention maps (Liu et al., 2022).
Wavelet-based Multi-scale Decomposition: Shift-invariant MODWT decomposes univariate input $Y_t$ into $J$ detail $D_{j,t}$ and final smooth $S_{J,t}$ components, without downsampling. Separate Transformer encoders model each band, then outputs are summed for perfect reconstruction. This approach robustly captures local non-stationarities, multi-frequency trends, and boundary effects (Sasal et al., 2022).
Hierarchical Probabilistic Generative Modules: HTV-Trans (Wang et al., 2024) integrates a multi-level VAE generative model atop stationarized input, learning hierarchical latent vectors $z^{(l)}_{t,n}$ per scale and injecting their upsampled interpolations into the Transformer encoder input. This enables attention mechanisms to directly access recovered non-stationary signals at multiple temporal resolutions, balancing information via a tunable parameter $\alpha$ .
Wavelet Convolution, Period-Aware Attention, and Channel-Temporal Mixed MLP: TwinS (Hu et al., 2024) utilizes variable-width convolutional kernels (powers of two) to extract nested periodic components; attention scores are modulated by per-head period relevance matrices computed via depthwise separable convnets, enabling detection of aperiodic or missing cycles. Final representations are mixed joint-channel and joint-time through a single MLP to handle hysteresis across series.
Continuous-Time Attention and Neural ODE Lifting: ContiFormer (Chen et al., 2024) generalizes self-attention via continuous inner-products between query and key trajectories solved by Neural ODEs, aligning with arbitrary query times on irregularly sampled data. Attention scores are integrated over domains $[t_i, t]$ , and value aggregations proceed via quadrature, strictly subsuming prior kernelized or time-aware attention classes.

3. Memory-Based and Meta-Learning Approaches for Non-stationarity

Non-stationarity often requires adaptive or meta-learned models able to track latent regime shifts.

Piecewise-Stationary Meta-Learning: Transformers trained on sequences sampled from hierarchical priors over partitions and segment parameters can approximate Bayes-optimal prediction, implementing implicit amortized Bayesian inference over latent switches (Genewein et al., 2023). The self-attention mechanism acts as sufficient memory to track segment posteriors, provided model capacity and positional encoding are adequate.
Zero-shot Meta-RL for Nonstationary Agents: BeTrans (Mon-Williams et al., 2023) operates by inferring a discrete or Gaussian latent $v_t$ on each timestep, summarizing human agent behavior via the transformer applied to the recent window $h_{R,L}$ . This context-conditional latent is concatenated to the robot’s state, ensuring rapid adaptation and robust performance under mid-episode switches or noisy observations.
In-context Scheduling and Restart Mechanisms: In reinforcement learning, Transformers can emulate classical dynamic adaptation algorithms (MASTER, window scheduler, statistical test-restart) purely via self-attention and internal MLPs. Pretrained transformers match state-of-the-art dynamic regret bounds up to minimax-optimal scaling in non-stationary environments (Chen et al., 22 Aug 2025).

4. Empirical Benchmarks and Performance Analyses

The efficacy of non-stationary Transformer variants has been established over a broad spectrum of temporal, spatial, and RL tasks.

Forecasting Tasks: Non-stationary Transformers (Liu et al., 2022) reduce MSE by 49.43% (Transformer), 47.34% (Informer), and 46.89% (Reformer) compared to vanilla baselines, with consistent improvements over Autoformer, FEDformer, and ETSformer. HTV-Trans (Wang et al., 2024) achieves best-in-class error rates for long-horizon MTS prediction (e.g. ETTh1, 720-step: MSE=0.489 vs. Autoformer’s 0.599).
Wavelet and Multi-scale Models: W-Transformers (Sasal et al., 2022) outperform ARIMA, SETAR, DeepAR, standard Transformers, and other baselines across RMSE, MAE, and sMAPE in website traffic, stock, sunspot, and small-sample dengue datasets, with especially pronounced gains on long-term horizons.
TwinS and Multivariate Time Series: TwinS (Hu et al., 2024) matches or surpasses PatchTST, MICN, TimesNet, and FEDformer, reducing MSE by up to 25.8% across eight major datasets and long-term horizons.
Spatial Nonstationarity: Vision Transformers and SwinT (Liu et al., 2022) cut relative prediction errors on geostatistical porosity fields by 1.5–6.6pp (scale-driven) and 4.0pp (trend-residual) versus CNNs, with gains increasing with severity of nonstationarity or trend strength.
Continuous-time/Irregular Time Series: ContiFormer (Chen et al., 2024) achieves 60–75% lower RMSE/MAE versus vanilla Transformer and Latent ODE models in synthetic spiral and real-world irregular sequence forecasting and classification.
Reinforcement Learning: Pretrained transformers with in-context adaptation match the dynamic regret and suboptimality of classical non-stationary algorithms in bandit and RL domains, remaining robust under moderate-to-severe regime changes (Chen et al., 22 Aug 2025).

5. Practical Guidelines, Limitations, and Extensions

Correct handling of non-stationarity is fundamental to Transformer deployment in real-world settings.

Algorithmic Integration: Non-stationary modifications are lightweight, with minimal parameter overhead ( $<0.2\%$ ), and do not alter worst-case computational scaling ( $O(n^2)$ or $O(n\log n)$ ) (Liu et al., 2022).
Configuration: Shared de-stationary parameters across layers, sliding-window normalization, and meta-training data drawn from realistic priors provide optimal stability and accuracy (Genewein et al., 2023, Wang et al., 2024).
Limitations: Series stationarization and de-stationary attention primarily correct for mean/variance shifts; seasonality, time-varying higher moments, and richer nonstationarity phenomena may require hierarchical latent-variable models, wavelet/patch embeddings, or continuous-time dynamics (Sasal et al., 2022, Hu et al., 2024, Chen et al., 2024).
Extensions: Hierarchical mixture-of-expert architectures for unknown or hierarchical switching, continuous latent-state models (e.g., HMMs integrated with memory-based attention), and meta-learning for changepoint-detection are open directions (Genewein et al., 2023, Wang et al., 2024).
Spatial Context: Full self-attention architectures efficiently span large domains and mitigate trend-driven nonstationarity; windowed variants (SwinT) preserve quasi-linear complexity for even larger spatial fields (Liu et al., 2022).

6. Theoretical Insights and Universal Approximation Results

Non-stationary Transformers possess provable expressive capabilities:

Bayesian Optimality: Transformer models minimize sequential log-loss to converge (in expectation) to Bayesian mixture predictors when trained over a distribution of nonstationary tasks (Genewein et al., 2023).
Universal Continuous-time Attention Representation: ContiFormer strictly generalizes all discrete and kernelized attentional frameworks on irregular sequence data, achieving universal approximation of desired attention matrices via well-posed vector-field solutions to ODEs (Chen et al., 2024).
Dynamic Regret: Transformer models, with adequate pretraining and substitution of scheduling/restart mechanisms, achieve minimax dynamic regret bounds, matching expert nonstationary RL algorithms up to log factors (Chen et al., 22 Aug 2025).

7. Future Research Directions

Further advances in non-stationary Transformers are expected in several areas:

Modeling Richer Priors: Extending from hand-designed to learned or hierarchical priors over switching (changepoint) processes and latent environmental dynamics.
Integration with Latent State Models: Incorporating continuous-state graphical models (HMMs, latent ODEs) into the attention framework for principled reasoning under severe or abrupt distribution shifts.
Task-Specific Embedding Strategies: Developing custom positional, temporal, or spatial encodings that adaptively modulate attention to extract nonstationary dependencies at multiple scales.
Generalization and Robustness: Addressing extrapolation beyond trained context lengths, mitigating inductive-bias sensitivity, and systematically benchmarking across high-dimensional multivariate and irregular time series domains.

Non-stationary Transformers thus represent a critical evolution in sequential modeling, bridging meta-learning, stochastic hierarchies, and dynamic temporal-spatial reasoning, and yielding consistent gains in predictive accuracy and theoretical optimality.