SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Published 25 Feb 2025 in cs.LG | (2502.18394v7)

Abstract: Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.

Abstract PDF Upgrade to Chat

Summary

The paper introduces FFTNet, which leverages FFT to replace quadratic self-attention with an O(n log n) global token mixing mechanism.
The methodology features a four-step process: Fourier transform, adaptive spectral filtering, modReLU activation, and inverse FFT, ensuring effective token interactions.
Experimental results on Long Range Arena and ImageNet illustrate competitive accuracy and reduced latency compared to standard self-attention models.

SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts

Introduction

The paper presents FFTNet, an innovative spectral filtering paradigm that addresses the scalability challenges of conventional self-attention mechanisms in neural networks. The authors leverage the Fast Fourier Transform (FFT), offering a reduction in complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ . This computational efficiency is achieved by transforming sequence inputs into the frequency domain, where orthogonality and energy preservation are maintained through Parseval's theorem. The result is an adaptive global token mixing framework that efficiently captures long-range dependencies.

Methodology

The FFTNet framework comprises four main steps:

Fourier Transform: The discrete Fourier transform decomposes the sequence into orthogonal frequency components. This step encapsulates global interactions without pairwise computations, laying the groundwork for subsequent filtering.
Adaptive Spectral Filtering: The authors introduce a learnable spectral filter using a context vector derived from the sequence mean. A modulation tensor is computed through an MLP, allowing dynamic spectral filter adjustment. This adaptability emphasizes salient frequencies crucial for intricate patterns.
Nonlinear Activation (modReLU): To enhance representation capabilities beyond linear transformations, modReLU is applied to complex components, conditioning on input magnitude while retaining phase information.
Inverse Fourier Transform: The filtered frequency components are transformed back into the token domain, yielding a globally mixed representation that incorporates adaptive filter dynamics and nonlinear spectral processing.

Computational Complexity

The FFT-based approach prominently reduces computational demand, offering:

Efficient Global Interactions: FFT's $\mathcal{O}(n \log n)$ complexity affords scalable handling of long sequences, contrasting sharply with the quadratic demands of self-attention.
Minimal Overhead: Adaptive filtering and modReLU introduce only linear overhead, preserving the efficient computational baseline established by FFT.
Figure 1: Latency comparison of FFTNetViT vs. Standard ViT for varying batch sizes on ImageNet. FFTNetViT scales faster than standard self-attention.

Experiments

The authors validate their approach on both the Long Range Arena (LRA) benchmarks and the ImageNet classification tasks, showcasing favorable results:

Long Range Arena: FFTNet exceeded performance on various tasks (e.g., ListOps, Text) against both standard Transformers and other Fourier-based models like FNet. It achieved a significant accuracy margin on average across these benchmarks.
ImageNet Classification: FFTNet variants demonstrated competitive Top-1 and Top-5 accuracy, reducing operational overhead through efficient parameter and FLOP management compared to standard ViT architectures.

Theoretical Guarantees

The authors provide robust theoretical underpinnings supporting FFTNet's design, emphasizing:

Energy Preservation: Parseval's theorem guarantees the transformation's stability, preserving signal norms across transformations.
Expressivity & Approximation: FFTNet's configuration allows approximation of self-attention dynamics, with complex convolution providing scalable, adaptive interaction modeling.

Conclusion

FFTNet emerges as a practical and efficacious alternative to self-attention, merging FFT's theoretical solid foundation with adaptive filtering capabilities. The proposed framework achieves competitive accuracy and improved computational efficiency, neatly encapsulating a pathway towards scalable sequence modeling without the quadratic encumbrances of conventional self-attention. This work underscores the potential of combining spectral processing with adaptive learning strategies, setting a precedent for future developments in efficient model architectures.

Markdown Report Issue