Fredformer Model: Frequency-Aware Forecasting
- Fredformer is a Transformer-based framework that addresses frequency bias in time series forecasting by equalizing attention across low and high-frequency signal components.
- It employs a frequency-domain processing pipeline with normalization and localized channel-wise attention to preserve critical signal details, achieving lower MSE on diverse benchmarks.
- Its lightweight Nyström approximation variant enhances scalability in highly multivariate settings while reducing computational complexity with minimal accuracy loss.
Fredformer is a Transformer-based framework specifically designed to mitigate the frequency bias exhibited by mainstream Transformers in the context of time series forecasting. Standard Transformer models, despite their empirical success, consistently favor low-frequency, high-amplitude features while suppressing high-frequency, low-amplitude signal components. Fredformer introduces a frequency-domain processing pipeline involving normalization and localized channel-wise attention, equalizing the model's focus across frequency bands and preserving critical high-frequency information. A lightweight Nyström-approximated variant further enables scalability to highly multivariate settings while maintaining computational efficiency and competitive predictive accuracy (Piao et al., 2024).
1. Frequency Bias in Transformer-Based Time Series Forecasting
Extensive analyses demonstrate that classic Transformers trained on time series (and its multivariate extensions ) tend to reconstruct the low-frequency structure but consistently fail to recover mid- to high-frequency details in with respect to . This phenomenon is formalized through synthetic experiments wherein signals are composed of "key" frequency components with ordered amplitudes . The per-frequency relative error,
is shown to be inversely correlated to the amplitude proportion,
specifically, larger predicts smaller , demonstrating that Transformer's loss landscape prioritizes features with greater frequency-domain energy. This outcome persists regardless of frequency ordering, verifying that the bias hinges solely on amplitude magnitudes, not spectral location.
2. Architectural Components of Fredformer
Fredformer inserts a structured frequency-domain pipeline into the Transformer, comprised of four principal stages:
(i) DFT-to-IDFT Backbone:
The model first converts the input into the frequency domain using the Discrete Fourier Transform (DFT),
Frequency-domain processing ensues, and the output spectrum is transformed back via the inverse DFT (IDFT).
(ii) Frequency Refinement and Normalization:
The frequency axis of is partitioned into patches , each of width . Amplitudes within each patch are normalized so that
removing the amplitude disparities across bands and debiasing subsequent attention computations.
(iii) Frequency-local Independent Modeling:
Each normalized patch is processed by a local Transformer encoder, treating each channel (row) as a token. Channel-wise (not temporal or cross-band) attention is applied:
This isolates channel interactions at each frequency, ensuring that is no longer determined by .
(iv) Frequency-wise Summarization and IDFT:
Refined outputs from all patches are concatenated or projected to reconstruct the complete frequency matrix , which is then returned to the time domain via IDFT.
3. Lightweight Variant: Nyström-Fredformer
To address the per-patch complexity of self-attention with large channel counts, Fredformer integrates a Nyström-based approximation. Here, landmark rows/columns are selected within each patch to define . The original attention matrix
is approximated by
where is the Moore–Penrose inverse of , and and are computed from the softmax between full and downsampled queries/keys. This innovation reduces frequency patch complexity from to , enabling deployment on datasets with hundreds of channels while yielding negligible accuracy loss.
4. Forecasting Benchmarks and Experimental Validation
Fredformer and Nyström-Fredformer are validated on eight standard multivariate benchmarks: Weather (21 channels), Electricity (), ETTh1/ETTh2/ETTm1/ETTm2 (), Solar-Energy (), and Traffic (), with lookback and prediction horizons . Comparisons include 11 baselines such as Autoformer, FEDformer, Pyraformer, Crossformer, PatchTST, Stationary, iTransformer, DLinear, RLinear, TiDE, and TimesNet.
Averaged across $8$ datasets and $4$ horizons ($32$ settings), Fredformer achieves the lowest mean squared error (MSE) in 60 out of 80 cases. Typical results:
| Dataset | Fredformer (MSE/MAE) | 2nd Best | Next Best |
|---|---|---|---|
| ETTh1 | 0.435 / 0.426 | 0.454 / 0.447 (iTrans) | 0.446 / 0.434 (PatchTST) |
| ETTh2 | 0.365 / 0.393 | 0.383 / 0.407 (PatchTST) | 0.374 / 0.398 (iTrans) |
| ECL | 0.175 / 0.269 | 0.178 / 0.270 (iTrans) | 0.192 / 0.296 (Autoformer) |
5. Ablation, Sensitivity, and Computational Analysis
Ablation experiments on ETTh1 and Weather datasets identify two critical components: (1) channel-wise attention (CW) and (2) frequency refinement/patching (FR). Removal of either results in significant degradation:
| Setting | ETTh1 MSE / MAE | Weather MSE/MAE |
|---|---|---|
| Full | 0.384 / 0.396 | 0.246 / 0.273 |
| –No-CW | 0.418 / 0.419 | 0.262 / 0.290 |
| –No-FR | 0.539 / 0.485 | 0.293 / 0.322 |
Finer patch divisions correlate with better MSE (ETTh1, yields ). Computationally, Fredformer requires approximately 408MB GPU memory on ETTh1 () and 4.3GB on Electricity (); Nyström-Fredformer reduces the latter to 3.3GB (a 25% reduction) while maintaining equivalent MSE (0.212 vs. 0.213).
6. Significance and Implications
Fredformer overcomes Transformer frequency bias by modeling directly in the frequency domain, explicitly normalizing amplitudes per band, and applying channel-wise attention. This equalizes the treatment of high-frequency, low-energy components that are overlooked by standard architectures. The lightweight Nyström-Fredformer extends the applicability to domains featuring hundreds of channels, with minimal computational or predictive penalty. A plausible implication is that frequency-domain architectural correction could be adapted to other forecasting or generative tasks exhibiting energy-driven feature suppression. Extensive multi-benchmark evaluation demonstrates robustness and consistent superiority to prior state-of-the-art paradigms across both accuracy and scalability dimensions (Piao et al., 2024).