Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Published 23 Dec 2024 in cs.AI and cs.CL | (2412.17739v4)

Abstract: Extending the context length of LLMs (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While prior works mainly address RoPE's limitations within attention, this paper uncovers the adverse effects on length generalization from nearly all parts of LMs. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectrum damage caused by: 1) linear layers and activation functions; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs \textit{Fourier Series} and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling.

Abstract PDF HTML Upgrade to Chat

Summary

The paper’s main contribution is introducing FoPE, which models dimensions as Fourier series and zeroes out undertrained frequencies to correct spectral damage.
It applies DSP theory to reveal that linear layers and activations in RoPE cause spectral leakage, hindering periodic attention.
Experimental results demonstrate that FoPE achieves lower perplexity and higher accuracy on long-context tasks compared to previous methods.

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Introduction

"Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization" (2412.17739) addresses the challenges of context length extension in LMs through position embeddings. Specifically, this paper critiques Rotary Position Embedding (RoPE) within attention mechanisms, revealing how it performs a Non-Uniform Discrete Fourier Transform (NUDFT) that enables periodic attention but suffers from spectral damage due to linear layers, activation functions, and inadequately trained frequency components. The authors propose a novel Fourier Position Embedding (FoPE) to mitigate these issues by modeling dimensions as Fourier series and zeroing out destructive frequency components, thus enhancing both periodic extension and length generalization.

Fourier Position Embedding (FoPE) Method

The paper introduces FoPE as an advancement over RoPE, focusing on frequency-domain properties to improve length generalization and attention robustness. The authors propose two key enhancements:

Modeling Dimensions as Fourier Series: While RoPE treats dimensions as single-frequency functions causing spectral distortion, FoPE utilizes Fourier Series to express each dimension, enabling the separation of information across different wavelengths and rectifying spectral damage.
Zeroing Out Undertrained Frequencies: FoPE eliminates inadequately trained frequency components within the attention framework by substituting them with zero-frequency components. This substitution preserves long-wavelength information crucial for periodic extension without damaging model fitting.

Theoretical Insights

Using Discrete Signal Processing (DSP) theory, the authors reveal that RoPE implicitly performs NUDFT, which facilitates periodic attention but is hampered by spectral challenges caused by LM components outside the attention. Linear layers and activation functions lead to spectrum leakage and distortion, respectively. Additionally, time-domain truncation during training introduces unwanted noise within frequency components, further confining RoPE's periodic capabilities. By addressing these frequency-domain setbacks, FoPE succeeds in delivering robust length generalization.

Implementation and Practical Considerations

FoPE can be implemented in practice by setting hyperparameters that control the dimensional modeling and component clipping:

Multi-frequency Modeling: Initialize vectors to reflect multi-frequency characteristics inherent to the Fourier series framework.
Frequency Clipping: Define and apply a floor frequency to change undertrained frequencies to zero, supporting stable length generalization.

FoPE introduces negligible overhead in computations and memory compared to RoPE, thanks to its efficient implementation using weight matrices to map frequency coefficients.

Experimental Evaluation

Experiments demonstrate FoPE's effectiveness through:

Pre-Training Evaluations: In tests across various model scales and datasets, FoPE maintains lower perplexity and higher accuracy on tasks needing long context awareness compared to RoPE and ALiBi. This underscores FoPE's ability to extrapolate lengths over diverse corpora and settings.
Ablation Studies: Evaluating distinct components and configurations within FoPE showcase its superior handling of spectral damage and learning dynamics. Adjustments to frequency distribution and coefficient variance further refine its length generalization capabilities.

Limitations and Future Work

While FoPE enhances length generalization through frequency-domain manipulations, extending this approach to tasks such as kv-cache compression and collaborative frequency-domain embeddings requires further exploration. The paper’s focus is on immediate spectral improvements within attention and highlights areas ripe for innovation in aligning DSP techniques with modern LM operations.

Conclusion

FoPE presents a significant advancement in position embedding techniques for LMs, offering a nuanced frequency-domain approach to bolster periodic extension and length generalization. Through modeling dimensions as Fourier Series and addressing spectral damages, FoPE achieves greater consistency and accuracy in long-context tasks, paving the way for enriched LM capabilities beyond standard scope. Future expansions in frequency-domain methodologies promise to further enhance LLM scalability and application.

Markdown Report Issue