Frequency Debiased Transformer

Updated 12 February 2026

Frequency Debiased Transformer is a model architecture that counteracts standard Transformers’ low-pass bias by equalizing attention across frequency bands.
It employs explicit frequency decomposition, patch-wise normalization, and localized self-attention to enhance high-frequency signal capture and overall forecasting accuracy.
Empirical evaluations on benchmarks like ETTh1, Weather, and CIFAR-100 demonstrate its superior performance with reduced MSE and heightened classification accuracy.

A Frequency Debiased Transformer (FDT) is a Transformer-based architecture specifically designed to counteract the empirical tendency of standard self-attention models to focus model capacity and learning almost exclusively on large-energy (typically low-frequency) spectral components in sequential data, leading to diminished modeling and predictive performance on high-frequency or low-amplitude components. This challenge—termed "frequency bias"—is especially deleterious in time series forecasting, vision, and spiking neural architectures, where accurate reconstruction or prediction often depends on fine-grained temporal or spatial variations that manifest predominantly in the high-frequency spectrum. Frequency Debiased Transformers employ structural, often spectral, modifications of input representation, normalization, self-attention, and information fusion to ensure equal (or adaptively balanced) prioritization across distinct frequency bands, mitigating low-pass bias and enhancing generalization to signals with heterogeneous or rapidly fluctuating dynamics (Piao et al., 2024, &&&1&&&, Bai et al., 22 Jan 2026).

1. Theoretical Foundation: Frequency Bias in Transformers

The standard Transformer, when applied to sequential or time-series data, manifests a pronounced bias towards learning features with naturally high spectral energy—typically dominated by low-frequency, slowly varying components. Formally, for an input series $x_1,\ldots,x_L$ , the Discrete Fourier Transform (DFT) yields $A_k = (1/L) \sum_{\ell=1}^L x_\ell\,e^{-i2\pi k\ell/L}$ , with "key components" $\{\tilde{A}_1,\ldots,\tilde{A}_N\}$ of maximal amplitude. Empirically, the relative spectral reconstruction error per frequency, $\Delta_k = |A'_k - \hat{A}_k| / |\hat{A}_k|$ , is minimized for large-amplitude components early in training and remains large for low-energy components, leading to a model error profile $-\lvert\Delta_k\rvert \propto P(\tilde{A}_k)$ , where $P(\tilde{A}_k)$ denotes the normalized amplitude proportion of component $k$ (Piao et al., 2024). This spectral bias is reinforced by both the energy distribution of real-world signals (Parseval’s theorem) and standard attention mechanisms aggregating dominant trends. As a result, critical high-frequency dynamics are systematically overlooked.

2. Structural Solutions: Frequency Debiasing Architectures

To mitigate frequency bias, Frequency Debiased Transformers adopt explicit architectural segmentation in the frequency domain. The Fredformer implements a three-stage debiasing design (Piao et al., 2024):

Frequency Decomposition and Patch-wise Normalization: After DFT, the spectral representation $A \in \mathbb{R}^{C \times L}$ is partitioned into $N$ bands $\{W_1,\ldots,W_N\}$ , each $W_n \in \mathbb{R}^{C \times S}$ , then normalized independently such that $\max(W_n^*) = 1$ (per-band normalization). This removes amplitude variation between bands, guaranteeing equal maximum energy allocation among all spectral regions.
Frequency-Local, Channel-wise Self-Attention: Each normalized band undergoes standard Transformer encoder processing independently, with self-attention restricted to within-band operations. Attention queries, keys, and values are constructed channel-wise per band, precluding dominant bands from overshadowing others in representation learning.
Frequency-Wise Recombination: Outputs from all bands are concatenated and projected back to the original spectral shape, followed by inverse DFT reconstruction for time-domain prediction. The entire process is integrated with standard time-domain MSE loss, ensuring any residual spectral imbalance affects optimization directly.

Complementing this, the Max-Former (in spiking and low-bit architectures) introduces frequency-enhancing operators—extra Max-Pooling in patch embedding and early-stage Depth-Wise Convolution—to directly augment high-frequency signal propagation and prevent low-pass attenuation inherent to LIF spiking neuron dynamics (Fang et al., 24 May 2025).

3. Layer-wise and Adaptive Frequency Allocation

Beyond fixed-band partitioning, newer architectures such as Dualformer introduce layer-wise hierarchical frequency allocation and dual-domain learning (Bai et al., 22 Jan 2026). At each encoder layer $n$ , only a specific band of the input frequency spectrum is processed:

Hierarchical Frequency Sampling (HFS) dynamically shifts the sampled frequency interval $[p^n, q^n]$ per layer so that shallow layers handle mostly high-frequency bands (capturing local or rapid fluctuations), while deeper layers gravitate towards low-frequency (trend) components.
This pipeline is dual-branch: a time-domain branch operates on the inverse-transformed, frequency-filtered input, while a frequency-domain branch employs autocorrelation-based attention. Outputs are fused via periodicity-aware weighting that adapts fusion weights according to the harmonic energy ratio of the current input, estimated as $w_f = E_h/E_f$ and $w_t = 1 - w_f$ (where $E_h$ is harmonic and $E_f$ total spectral energy).

This design enables distributive capacity allocation along the depth of the architecture and adaptive integration of time- and frequency-domain features, ensuring neither fine-grained nor global periodic components dominate convergence unduly.

4. Efficient Variants and Computational Considerations

Frequency Debiased Transformers must also contend with the computational overhead of spectral transforms and frequency-local attention. In Fredformer, the main bottleneck is $O(C^2L)$ for channel-wise attention within each frequency patch. To alleviate this, a Nyström method approximates full attention using a small subset of "landmark" channels, reducing complexity to $O(C L/P)$ per layer (with $P$ the number of bands), while maintaining performance parity with the full model (Piao et al., 2024). The cost of DFT/IDFT transforms persists as $O(C\,L \log L)$ overhead per sample, and patch size $S$ is selected by cross-validation to balance localization and bandwidth requirements.

Max-Former achieves parameter and energy efficiency by replacing some early self-attention modules with small-kernel Depth-Wise Convolution and inserting Max-Pooling within patch embeddings, both of which enhance high-frequency retention at negligible additional cost (Fang et al., 24 May 2025).

5. Empirical Performance and Ablations

Frequency Debiased Transformers have demonstrated significant empirical advantages over conventional architectures across a range of long-term forecasting benchmarks. Notable results (Piao et al., 2024, Bai et al., 22 Jan 2026):

Dataset	Model	Avg. MSE	Relative Gain
ETTh1	Fredformer	0.435	Best of 8 methods
ECL	Fredformer	0.175	Best overall
Weather	Dualformer	0.220	–15% vs next-best (MSE)
CIFAR-100	Max-Former	82.65% Top-1 Acc	+4.44% vs Spikformer
ImageNet	Max-Former	82.39% Top-1 Acc	+7.58% vs baseline

Ablation studies corroborate that both patch-wise normalization and channel-local attention are indispensable; notably, removing frequency normalization in Fredformer deteriorates Weather MSE from 0.246 to 0.293 (Piao et al., 2024). Correspondingly, in Max-Former, each additional Max-Pooling stage recovers greater high-frequency content and boosts accuracy (Fang et al., 24 May 2025). Dualformer’s omission of periodicity-aware weighting increases Weather MSE by +0.021, directly quantifying the role of spectral adaptivity (Bai et al., 22 Jan 2026).

6. Extensions and Future Directions

Ongoing directions in Frequency Debiased Transformer research emphasize:

Adaptive Band Selection: Dynamic learning of frequency patch sizes or locations (analogous to learnable spectral filters or STFT/wavelet bases).
Joint Time-Frequency Modeling: Exploiting multi-resolution, nonuniform, or learned transforms to support more complex or nonstationary patterns.
Generalization to Other Modalities: Application of frequency debiasing strategies in vision Transformers, speech processing, and spiking neural models leverages analogous spectral principles (Fang et al., 24 May 2025).
Resource-Constrained Deployment: Low-rank or structured approximations (e.g., Nyström, DWC) to retain debiasing effectiveness in edge or low-power contexts.

7. Significance and Broader Impact

Frequency Debiased Transformer architectures directly address fundamental limitations of standard self-attention in sequential data modeling: the endogenous low-pass characteristic that suppresses high-frequency, transient, or noisy features. By structurally allocating model capacity across the full frequency spectrum—either equally (Fredformer), adaptively (Dualformer), or with explicit spectral enhancement (Max-Former)—these architectures achieve robustness and superior predictive accuracy, particularly in long-horizon, heterogeneous, or detail-critical forecasting tasks. Their empirical supremacy across standard benchmarks, together with their extensibility to resource-efficient and neuromorphic computing, indicate a compelling direction for Transformer research in time series, vision, and spike-based learning (Piao et al., 2024, Fang et al., 24 May 2025, Bai et al., 22 Jan 2026).