De-stationary Attention Methods

Updated 18 January 2026

De-stationary attention is a family of methods that restores crucial statistical properties, such as mean and variance, lost during standard normalization in sequence models.
It utilizes mechanisms like MLP-based scaling, frequency decomposition, and recurrent elements to adjust attention computations for non-stationary signals.
Empirical results reveal significant reductions in forecasting errors and improved sensitivity to bursty, variable dynamics compared to traditional attention approaches.

De-stationary attention refers to a family of methodologies that explicitly address the limitations of standard attention mechanisms when modeling non-stationary or distribution-shifting data, particularly in time series forecasting and sequence modeling. These approaches modify the formulation or operational context of attention to recover or re-inject non-stationary information—such as trends, scale, or temporal bursts—lost during pre-processing or absent due to architectural bottlenecks. Both statistical insight and domain-decomposition motivate de-stationary attention, enabling models to better capture variable dynamics in non-stationary environments and to avoid the collapse of attention maps to indistinguishable forms.

1. Conceptual Motivation: Stationarity, Over-Stationarization, and Attention Bottlenecks

The effectiveness of attention-based models such as Transformers relies fundamentally on the ability of self-attention to differentiate among heterogeneous segments—sequences, modal contexts, or temporal windows. Many deep learning pipelines, especially for time series, apply series stationarization (z-score normalization per chunk) to stabilize learning, enforce predictive homogeneity, and combat distributional drift. However, this process removes chunk-specific statistics such as mean and variance, leading to “over-stationarization.” As a result, distinct temporal patterns, bursts, and amplitude shifts are erased, causing self-attention modules to generate nearly identical attention maps regardless of original chunk characteristics (Liu et al., 2022).

This uniformity violates the expressiveness needed for real-world non-stationary environments, rendering models insensitive to eventful or bursty phenomena. Addressing this tension—retaining both the predictability of stationary inputs and the expressiveness of non-stationary dependencies—forms the crux of de-stationary attention methods.

2. Mathematical Formulation and Mechanisms

De-stationary attention restores chunk‐specific information or adjusts attention operations so that models regain sensitivity to non-stationary phenomena. Three principal mechanisms appear in the literature.

2.1 De-Stationary Attention via Statistical Reinjection

“Non-stationary Transformers” formalize de-stationary attention by deriving the relationship between attention logits calculated on raw and normalized series. Let $Q, K, V\in\mathbb{R}^{S\times d}$ denote the standard queries, keys, and values. For normalized series $x'$ with mean $\mu_x$ and standard deviation $\sigma_x$ per chunk, the relation is:

$Q' = \frac{Q - 1\,\mu_Q^T}{\sigma_x}, \quad K' = \frac{K - 1\,\mu_K^T}{\sigma_x}$

yielding pre-softmax attention logits:

$Q'K'^T = \frac{1}{\sigma_x^2}\left(QK^T - 1(\mu_Q^T K^T) - (Q\mu_K)1^T + 1(\mu_Q^T\mu_K)1^T\right)$

Softmax invariance allows collapse of rank-one correction terms, exposing that raw-series attention is equivalent to:

$\mathrm{Softmax}(QK^T/\sqrt{d}) = \mathrm{Softmax}\left((\sigma_x^2 Q'K'^T + 1\Delta)/\sqrt{d}\right)$

where the de-stationary scaling and shifting factors are defined as $\tau = \sigma_x^2$ , $\Delta = \mu_Q^T K^T$ . Because tracking $\mu_Q$ , $\mu_K$ through nonlinear layers is impractical, these factors are approximated by two MLPs fit directly from the unnormalized inputs and embedded statistics:

$\log\tau = \mathrm{MLP}_1(\sigma_x; x),\quad \Delta = \mathrm{MLP}_2(\mu_x; x)$

In practice, self-attention is replaced with the De-Stationary Attention operator:

$\operatorname{DA}(Q', K', V'; \tau, \Delta) = \mathrm{Softmax}\left(\frac{\tau Q'K'^T + 1\,\Delta^T}{\sqrt{d}}\right)V'$

This “resurrects” scale and level biases erased by normalization, promoting diversity of attention scores commensurate with raw-series variance (Liu et al., 2022).

2.2 De-Stationary Cross-Attention through Frequency Decomposition

AEFIN (Xiong et al., 11 May 2025) implements de-stationary attention by spectrally splitting each input into stable and unstable (non-stationary) components using frequency analysis:

Compute DFT per channel, retain top-K frequencies.
Inverse-DFT reconstructs $X_{\mathrm{nonstable}}$ (high-amplitude content); $X_{\mathrm{stable}} = X - X_{\mathrm{nonstable}}$ .
De-stationary cross-attention routes information from unstable to stable parts:
- $Q = X_{\mathrm{nonstable}} W^Q,\, K = X_{\mathrm{stable}} W^K,\, V = X_{\mathrm{stable}} W^V$
- $O_s = \mathrm{Attention}(Q, K, V)$ provides enriched stable embeddings.

This fusion enables stable predictors (e.g., Informer, DLinear, SCINet variants) to benefit from non-stationary episodes, thus maintaining distinctiveness in their temporal representations (Xiong et al., 11 May 2025).

2.3 Element-Wise Recurrent De-Stationary Attention

In sequence models, “Breaking the Attention Bottleneck” replaces O( $N^2$ ) softmax attention with a parameter-free, element-wise recurrence:

$h_t = f(x_t, x_{t-1}),\quad f=\max\text{ or }\min,\quad x_0:=0$

with context augmentation:

$c_t = \frac{1}{t} \sum_{i=1}^t x_i,\quad z_t = 0.5\,h_t + 0.5\,c_t$

This approach sidesteps the tendency toward static attention patterns in overparametrized, decoder-only models by refining local transitions (stationarity breaking) while cheaply incorporating global context (Hilsenbek, 2024).

3. Algorithmic Workflows and Pseudocode

The operational structure of de-stationary attention modules typically involves three phases: (1) data normalization/partition, (2) modified attention computation, and (3) output recomposition.

3.1 Series Normalization and Pre-Processing

For chunk-wise stationarization:

Compute per-window $\mu_x$ and $\sigma_x$ .
Apply $x'_i = (x_i - \mu_x)/\sigma_x$ . For frequency split (AEFIN):
$Z_t = \mathrm{DFT}(X_t);\, K_t = \text{TopK}(Z_t)$
$X_{\mathrm{nonstable}} = \mathrm{IDFT}(\mathrm{Filter}(Z_t, K_t))$
$X_{\mathrm{stable}} = X_t - X_{\mathrm{nonstable}}$

3.2 De-Stationary Attention Forward Pass

Replacement in Transformer-style block:

Compute $Q', K', V'$ from normalized input.
Evaluate $\tau, \Delta$ via small MLPs (shared across layers).
Calculate attention logits: $\tau Q'K'^T + 1\Delta^T$ .
Row-wise softmax, multiply by $V'$ , proceed with downstream modeling.

For AEFIN:

Form $Q$ from unstable, $K,V$ from stable embeddings, perform cross-attention.
Optionally, extend to two-way flow between stable and unstable channels.

3.3 Output De-normalization / Synthesis

For prediction:

$\,\hat{y} = \sigma_x \odot \hat{y}' + \mu_x\,$
For split models: $\,\hat{Y} = \hat{Y}_{\mathrm{stable}} + \hat{Y}_{\mathrm{nonstable}}$

3.4 Implementation Table

Step	De-Stationary Attention (Liu et al., 2022)	AEFIN Cross-Attention (Xiong et al., 11 May 2025)
Data Split	Per-chunk normalization	Fourier stable/unstable split
Attention Mechanism	Scaling/shifting via MLP	Cross-attention (unstable → stable)
Output Merge	Standard de-normalization	Weighted sum of stable/unstable

4. Empirical Results and Benchmarking

De-stationary attention consistently demonstrates large improvements over stationarized baseline models across diverse datasets and backbones.

On six benchmarks (Electricity, Traffic, Exchange, ILI, Weather, ETT), vanilla Transformer augmented with de-stationary attention achieves a 49.4% reduction in MSE versus baseline (Table 5).
Informer, Reformer, and Autoformer architectures augmented analogously exhibit 47–49% MSE reduction.
Ablation confirms that while normalization alone helps, adding de-stationary attention closes the remaining MSE gap by an additional ~10–20% (Table 6).
Augmented models recover ground-truth non-stationarity distributions (ADF tests), whereas normalized-only outputs become unrealistically stationary.

On ExchangeRate, coupling AEFIN to Informer or SCINet reduces MSE and MAE by 60–80%.
Consistent improvements are observed across ETTh1, ETTh2, ETTm1, ETTm2, Electricity, ExchangeRate, Traffic, and Weather.
Gains arise particularly in non-stationary or bursty scenarios, aligning with the intended functional role of de-stationary attention.

On character-level Shakespeare (nanoGPT small: 0.8M→0.6M params), recurrence+context models outperform standard attention (val-loss 1.692 vs. 1.555–1.557).
All settings confer ≈25% parameter reduction, lower loss, and transition from O( $N^2$ ) to O( $N$ ) time.

5. Architectural and Computational Considerations

De-stationary attention can be deployed as a plug-in to virtually any Transformer-style backbone with modest additional complexity.

De-stationary scaling (two MLPs) adds $<0.5\%$ parameters (Liu et al., 2022).
Runtime overhead is $<5\%$ in practice; MLP evaluation is amortized across entire window.
AEFIN’s cross-attention and Fourier layers add a larger computational footprint ($2$– $5\times$ parameter growth).
Partitioning between stable/unstable flows (AEFIN) or recurrence vs. average context (causal generation) offers a spectrum between computational frugality and representational flexibility.
Sharing scale/shift factors across layers is a standard practice for efficiency.
Maintaining dynamic normalization (e.g., RevIN) or exploring multi-scale features may further enhance flexibility (Xiong et al., 11 May 2025).

6. Comparative Discussion and Limitations

The principal advantage of de-stationary attention frameworks lies in their ability to reconcile the statistical homogeneity gained from normalization with the heterogeneity required for accurate modeling of real-world, non-stationary signals.

However, there are trade-offs and sensitivities:

Increased parameter count and inference time (cross-attention + Fourier in AEFIN).
Need for hyperparameter tuning (number of dominant frequencies $K$ , MLP capacity).
In decoders where attention shapes change (future masking), Δ-shifting may be optional.
While de-stationary factors are often shared across layers for simplicity, scenarios with extreme variance may benefit from layer-specific adaptation.
In small data or linear regimes, ultra-light recurrent de-stationary modules may suffice or outperform full attention (Hilsenbek, 2024), but this may not generalize to large-scale or highly multi-variate contexts.
Extensions may include dynamic or learnable frequency partitioning, sparse attention, or integration with other normalization paradigms.

7. Impact, Best Practices, and Future Directions

De-stationary attention reestablishes discriminability in the attention landscape for normalized or decomposed inputs, producing state-of-the-art forecasting results across benchmarks. Best practices include always pairing normalization with a mechanism for in-model recovery of distinguishing statistics, favoring small but expressive MLP projectors for de-stationary scaling and bias, and tuning model decomposition granularity to the degree of signal non-stationarity encountered.

Potential research directions involve:

Sparse and low-rank de-stationary modules to control parameter scaling.
Multi-scale and frequency-adaptive de-stationary attention.
Dynamic or gated routing between stable and unstable flows.
Investigation of de-stationary mechanisms in domains beyond time series, such as video understanding or molecular sequences.

De-stationary attention thus bridges the gap between statistical robustness and context-sensitive modeling, underpinning recent advances in both theoretical understanding and empirical capability in sequence modeling and time series forecasting (Liu et al., 2022, Xiong et al., 11 May 2025, Hilsenbek, 2024).

Markdown Report Issue Upgrade to Chat

References (3)

Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting (2022)

Non-Stationary Time Series Forecasting Based on Fourier Analysis and Cross Attention Mechanism (2025)

Breaking the Attention Bottleneck (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to De-stationary Attention.

De-stationary Attention Methods

1. Conceptual Motivation: Stationarity, Over-Stationarization, and Attention Bottlenecks

2. Mathematical Formulation and Mechanisms

2.1 De-Stationary Attention via Statistical Reinjection

2.2 De-Stationary Cross-Attention through Frequency Decomposition

2.3 Element-Wise Recurrent De-Stationary Attention

3. Algorithmic Workflows and Pseudocode

3.1 Series Normalization and Pre-Processing

3.2 De-Stationary Attention Forward Pass

3.3 Output De-normalization / Synthesis

3.4 Implementation Table

4. Empirical Results and Benchmarking

4.1 Non-stationary Transformers (Liu et al., 2022)

4.2 AEFIN (Xiong et al., 11 May 2025)

4.3 Recurrent De-Stationary Function (Hilsenbek, 2024)

5. Architectural and Computational Considerations

6. Comparative Discussion and Limitations

7. Impact, Best Practices, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

De-stationary Attention Methods

1. Conceptual Motivation: Stationarity, Over-Stationarization, and Attention Bottlenecks

2. Mathematical Formulation and Mechanisms

2.1 De-Stationary Attention via Statistical Reinjection

2.2 De-Stationary Cross-Attention through Frequency Decomposition

2.3 Element-Wise Recurrent De-Stationary Attention

3. Algorithmic Workflows and Pseudocode

3.1 Series Normalization and Pre-Processing

3.2 De-Stationary Attention Forward Pass

3.3 Output De-normalization / Synthesis

3.4 Implementation Table

4. Empirical Results and Benchmarking

4.1 Non-stationary Transformers (Liu et al., 2022)

4.2 AEFIN (Xiong et al., 11 May 2025)

4.3 Recurrent De-Stationary Function (Hilsenbek, 2024)

5. Architectural and Computational Considerations

6. Comparative Discussion and Limitations

7. Impact, Best Practices, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics