Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fredformer Model: Frequency-Aware Forecasting

Updated 12 February 2026
  • Fredformer is a Transformer-based framework that addresses frequency bias in time series forecasting by equalizing attention across low and high-frequency signal components.
  • It employs a frequency-domain processing pipeline with normalization and localized channel-wise attention to preserve critical signal details, achieving lower MSE on diverse benchmarks.
  • Its lightweight Nyström approximation variant enhances scalability in highly multivariate settings while reducing computational complexity with minimal accuracy loss.

Fredformer is a Transformer-based framework specifically designed to mitigate the frequency bias exhibited by mainstream Transformers in the context of time series forecasting. Standard Transformer models, despite their empirical success, consistently favor low-frequency, high-amplitude features while suppressing high-frequency, low-amplitude signal components. Fredformer introduces a frequency-domain processing pipeline involving normalization and localized channel-wise attention, equalizing the model's focus across frequency bands and preserving critical high-frequency information. A lightweight Nyström-approximated variant further enables scalability to highly multivariate settings while maintaining computational efficiency and competitive predictive accuracy (Piao et al., 2024).

1. Frequency Bias in Transformer-Based Time Series Forecasting

Extensive analyses demonstrate that classic Transformers trained on time series XRLX \in \mathbb{R}^L (and its multivariate extensions XRC×LX \in \mathbb{R}^{C \times L}) tend to reconstruct the low-frequency structure but consistently fail to recover mid- to high-frequency details in Y^\hat{Y} with respect to YtrueY_{\text{true}}. This phenomenon is formalized through synthetic experiments wherein signals are composed of NN "key" frequency components with ordered amplitudes A~1A~2|\tilde{A}_1| \geq |\tilde{A}_2| \geq \dots. The per-frequency relative error,

Δk=AkA^kA^k,\Delta_k = \frac{|A'_k - \hat{A}_k|}{|\hat{A}_k|},

is shown to be inversely correlated to the amplitude proportion,

P(A~k)=A~kn=1NA~n;P(\tilde{A}_k) = \frac{|\tilde{A}_k|}{\sum_{n=1}^N |\tilde{A}_n|};

specifically, larger P(A~k)P(\tilde{A}_k) predicts smaller Δk\Delta_k, demonstrating that Transformer's loss landscape prioritizes features with greater frequency-domain energy. This outcome persists regardless of frequency ordering, verifying that the bias hinges solely on amplitude magnitudes, not spectral location.

2. Architectural Components of Fredformer

Fredformer inserts a structured frequency-domain pipeline into the Transformer, comprised of four principal stages:

(i) DFT-to-IDFT Backbone:

The model first converts the input XRC×LX \in \mathbb{R}^{C \times L} into the frequency domain using the Discrete Fourier Transform (DFT),

Ak=1Ll=1LXlei2πkl/L.A_k = \frac{1}{L} \sum_{l=1}^{L} X_l e^{-i 2\pi k l / L}.

Frequency-domain processing ensues, and the output spectrum ACC×LA' \in \mathbb{C}^{C \times L} is transformed back via the inverse DFT (IDFT).

(ii) Frequency Refinement and Normalization:

The frequency axis of AA is partitioned into NN patches W1,,WNCC×SW_1, \ldots, W_N \subset \mathbb{C}^{C \times S}, each of width S=L/NS = L/N. Amplitudes within each patch are normalized so that

Wn=σ(Wn),with max(Wn)=1 n,W^*_n = \sigma(W_n), \qquad \text{with} \ \max(|W^*_n|) = 1 \ \forall n,

removing the amplitude disparities across bands and debiasing subsequent attention computations.

(iii) Frequency-local Independent Modeling:

Each normalized patch WnW^*_n is processed by a local Transformer encoder, treating each channel (row) as a token. Channel-wise (not temporal or cross-band) attention is applied:

Attentionn=Softmax(WnWnq(WnWnk)Td)WnWnv.\text{Attention}_n = \text{Softmax} \left( \frac{W^*_n W^q_n (W^*_n W^k_n)^{T}}{\sqrt{d}} \right) W^*_n W^v_n.

This isolates channel interactions at each frequency, ensuring that Δk\Delta_k is no longer determined by P(A~k)P(\tilde{A}_k).

(iv) Frequency-wise Summarization and IDFT:

Refined outputs from all patches are concatenated or projected to reconstruct the complete frequency matrix AA', which is then returned to the time domain via IDFT.

3. Lightweight Variant: Nyström-Fredformer

To address the O(C2)O(C^2) per-patch complexity of self-attention with large channel counts, Fredformer integrates a Nyström-based approximation. Here, mm landmark rows/columns are selected within each patch to define ASRm×mA_S \in \mathbb{R}^{m \times m}. The original attention matrix

Softmax(QKT)\text{Softmax}(QK^T)

is approximated by

FSAS+BS,F_S A_S^+ B_S,

where AS+A_S^+ is the Moore–Penrose inverse of ASA_S, and FSF_S and BSB_S are computed from the softmax between full and downsampled queries/keys. This innovation reduces frequency patch complexity from O(C2L/P)O(C^2 L / P) to O(CL/P)O(CL / P), enabling deployment on datasets with hundreds of channels while yielding negligible accuracy loss.

4. Forecasting Benchmarks and Experimental Validation

Fredformer and Nyström-Fredformer are validated on eight standard multivariate benchmarks: Weather (21 channels), Electricity (C=321C=321), ETTh1/ETTh2/ETTm1/ETTm2 (C=7C=7), Solar-Energy (C=137C=137), and Traffic (C=862C=862), with lookback L=96L=96 and prediction horizons H{96,192,336,720}H \in \{96,192,336,720\}. Comparisons include 11 baselines such as Autoformer, FEDformer, Pyraformer, Crossformer, PatchTST, Stationary, iTransformer, DLinear, RLinear, TiDE, and TimesNet.

Averaged across $8$ datasets and $4$ horizons ($32$ settings), Fredformer achieves the lowest mean squared error (MSE) in 60 out of 80 cases. Typical results:

Dataset Fredformer (MSE/MAE) 2nd Best Next Best
ETTh1 0.435 / 0.426 0.454 / 0.447 (iTrans) 0.446 / 0.434 (PatchTST)
ETTh2 0.365 / 0.393 0.383 / 0.407 (PatchTST) 0.374 / 0.398 (iTrans)
ECL 0.175 / 0.269 0.178 / 0.270 (iTrans) 0.192 / 0.296 (Autoformer)

5. Ablation, Sensitivity, and Computational Analysis

Ablation experiments on ETTh1 and Weather datasets identify two critical components: (1) channel-wise attention (CW) and (2) frequency refinement/patching (FR). Removal of either results in significant degradation:

Setting ETTh1 MSE / MAE Weather MSE/MAE
Full 0.384 / 0.396 0.246 / 0.273
–No-CW 0.418 / 0.419 0.262 / 0.290
–No-FR 0.539 / 0.485 0.293 / 0.322

Finer patch divisions correlate with better MSE (ETTh1, S{8,16,32,L}S \in \{8,16,32,L\} yields MSE={0.417,0.425,0.440,0.449}\text{MSE} = \{0.417, 0.425, 0.440, 0.449\}). Computationally, Fredformer requires approximately 408MB GPU memory on ETTh1 (C=7C=7) and 4.3GB on Electricity (C=321C=321); Nyström-Fredformer reduces the latter to 3.3GB (a 25% reduction) while maintaining equivalent MSE (0.212 vs. 0.213).

6. Significance and Implications

Fredformer overcomes Transformer frequency bias by modeling directly in the frequency domain, explicitly normalizing amplitudes per band, and applying channel-wise attention. This equalizes the treatment of high-frequency, low-energy components that are overlooked by standard architectures. The lightweight Nyström-Fredformer extends the applicability to domains featuring hundreds of channels, with minimal computational or predictive penalty. A plausible implication is that frequency-domain architectural correction could be adapted to other forecasting or generative tasks exhibiting energy-driven feature suppression. Extensive multi-benchmark evaluation demonstrates robustness and consistent superiority to prior state-of-the-art paradigms across both accuracy and scalability dimensions (Piao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fredformer Model.