Fredformer Model: Frequency-Aware Forecasting

Updated 12 February 2026

Fredformer is a Transformer-based framework that addresses frequency bias in time series forecasting by equalizing attention across low and high-frequency signal components.
It employs a frequency-domain processing pipeline with normalization and localized channel-wise attention to preserve critical signal details, achieving lower MSE on diverse benchmarks.
Its lightweight Nyström approximation variant enhances scalability in highly multivariate settings while reducing computational complexity with minimal accuracy loss.

Fredformer is a Transformer-based framework specifically designed to mitigate the frequency bias exhibited by mainstream Transformers in the context of time series forecasting. Standard Transformer models, despite their empirical success, consistently favor low-frequency, high-amplitude features while suppressing high-frequency, low-amplitude signal components. Fredformer introduces a frequency-domain processing pipeline involving normalization and localized channel-wise attention, equalizing the model's focus across frequency bands and preserving critical high-frequency information. A lightweight Nyström-approximated variant further enables scalability to highly multivariate settings while maintaining computational efficiency and competitive predictive accuracy (Piao et al., 2024).

1. Frequency Bias in Transformer-Based Time Series Forecasting

Extensive analyses demonstrate that classic Transformers trained on time series $X \in \mathbb{R}^L$ (and its multivariate extensions $X \in \mathbb{R}^{C \times L}$ ) tend to reconstruct the low-frequency structure but consistently fail to recover mid- to high-frequency details in $\hat{Y}$ with respect to $Y_{\text{true}}$ . This phenomenon is formalized through synthetic experiments wherein signals are composed of $N$ "key" frequency components with ordered amplitudes $|\tilde{A}_1| \geq |\tilde{A}_2| \geq \dots$ . The per-frequency relative error,

$\Delta_k = \frac{|A'_k - \hat{A}_k|}{|\hat{A}_k|},$

is shown to be inversely correlated to the amplitude proportion,

$P(\tilde{A}_k) = \frac{|\tilde{A}_k|}{\sum_{n=1}^N |\tilde{A}_n|};$

specifically, larger $P(\tilde{A}_k)$ predicts smaller $\Delta_k$ , demonstrating that Transformer's loss landscape prioritizes features with greater frequency-domain energy. This outcome persists regardless of frequency ordering, verifying that the bias hinges solely on amplitude magnitudes, not spectral location.

2. Architectural Components of Fredformer

Fredformer inserts a structured frequency-domain pipeline into the Transformer, comprised of four principal stages:

(i) DFT-to-IDFT Backbone:

The model first converts the input $X \in \mathbb{R}^{C \times L}$ into the frequency domain using the Discrete Fourier Transform (DFT),

$A_k = \frac{1}{L} \sum_{l=1}^{L} X_l e^{-i 2\pi k l / L}.$

Frequency-domain processing ensues, and the output spectrum $A' \in \mathbb{C}^{C \times L}$ is transformed back via the inverse DFT (IDFT).

(ii) Frequency Refinement and Normalization:

The frequency axis of $A$ is partitioned into $N$ patches $W_1, \ldots, W_N \subset \mathbb{C}^{C \times S}$ , each of width $S = L/N$ . Amplitudes within each patch are normalized so that

$W^*_n = \sigma(W_n), \qquad \text{with} \ \max(|W^*_n|) = 1 \ \forall n,$

removing the amplitude disparities across bands and debiasing subsequent attention computations.

(iii) Frequency-local Independent Modeling:

Each normalized patch $W^*_n$ is processed by a local Transformer encoder, treating each channel (row) as a token. Channel-wise (not temporal or cross-band) attention is applied:

$\text{Attention}_n = \text{Softmax} \left( \frac{W^*_n W^q_n (W^*_n W^k_n)^{T}}{\sqrt{d}} \right) W^*_n W^v_n.$

This isolates channel interactions at each frequency, ensuring that $\Delta_k$ is no longer determined by $P(\tilde{A}_k)$ .

(iv) Frequency-wise Summarization and IDFT:

Refined outputs from all patches are concatenated or projected to reconstruct the complete frequency matrix $A'$ , which is then returned to the time domain via IDFT.

3. Lightweight Variant: Nyström-Fredformer

To address the $O(C^2)$ per-patch complexity of self-attention with large channel counts, Fredformer integrates a Nyström-based approximation. Here, $m$ landmark rows/columns are selected within each patch to define $A_S \in \mathbb{R}^{m \times m}$ . The original attention matrix

$\text{Softmax}(QK^T)$

is approximated by

$F_S A_S^+ B_S,$

where $A_S^+$ is the Moore–Penrose inverse of $A_S$ , and $F_S$ and $B_S$ are computed from the softmax between full and downsampled queries/keys. This innovation reduces frequency patch complexity from $O(C^2 L / P)$ to $O(CL / P)$ , enabling deployment on datasets with hundreds of channels while yielding negligible accuracy loss.

4. Forecasting Benchmarks and Experimental Validation

Fredformer and Nyström-Fredformer are validated on eight standard multivariate benchmarks: Weather (21 channels), Electricity ( $C=321$ ), ETTh1/ETTh2/ETTm1/ETTm2 ( $C=7$ ), Solar-Energy ( $C=137$ ), and Traffic ( $C=862$ ), with lookback $L=96$ and prediction horizons $H \in \{96,192,336,720\}$ . Comparisons include 11 baselines such as Autoformer, FEDformer, Pyraformer, Crossformer, PatchTST, Stationary, iTransformer, DLinear, RLinear, TiDE, and TimesNet.

Averaged across $8$ datasets and $4$ horizons ($32$ settings), Fredformer achieves the lowest mean squared error (MSE) in 60 out of 80 cases. Typical results:

Dataset	Fredformer (MSE/MAE)	2nd Best	Next Best
ETTh1	0.435 / 0.426	0.454 / 0.447 (iTrans)	0.446 / 0.434 (PatchTST)
ETTh2	0.365 / 0.393	0.383 / 0.407 (PatchTST)	0.374 / 0.398 (iTrans)
ECL	0.175 / 0.269	0.178 / 0.270 (iTrans)	0.192 / 0.296 (Autoformer)

5. Ablation, Sensitivity, and Computational Analysis

Ablation experiments on ETTh1 and Weather datasets identify two critical components: (1) channel-wise attention (CW) and (2) frequency refinement/patching (FR). Removal of either results in significant degradation:

Setting	ETTh1 MSE / MAE	Weather MSE/MAE
Full	0.384 / 0.396	0.246 / 0.273
–No-CW	0.418 / 0.419	0.262 / 0.290
–No-FR	0.539 / 0.485	0.293 / 0.322

Finer patch divisions correlate with better MSE (ETTh1, $S \in \{8,16,32,L\}$ yields $\text{MSE} = \{0.417, 0.425, 0.440, 0.449\}$ ). Computationally, Fredformer requires approximately 408MB GPU memory on ETTh1 ( $C=7$ ) and 4.3GB on Electricity ( $C=321$ ); Nyström-Fredformer reduces the latter to 3.3GB (a 25% reduction) while maintaining equivalent MSE (0.212 vs. 0.213).

6. Significance and Implications

Fredformer overcomes Transformer frequency bias by modeling directly in the frequency domain, explicitly normalizing amplitudes per band, and applying channel-wise attention. This equalizes the treatment of high-frequency, low-energy components that are overlooked by standard architectures. The lightweight Nyström-Fredformer extends the applicability to domains featuring hundreds of channels, with minimal computational or predictive penalty. A plausible implication is that frequency-domain architectural correction could be adapted to other forecasting or generative tasks exhibiting energy-driven feature suppression. Extensive multi-benchmark evaluation demonstrates robustness and consistent superiority to prior state-of-the-art paradigms across both accuracy and scalability dimensions (Piao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Fredformer: Frequency Debiased Transformer for Time Series Forecasting (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fredformer Model.