Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

Published 10 May 2025 in cs.CL and cs.LG | (2505.06633v1)

Abstract: Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.

Abstract PDF Upgrade to Chat

Summary

The paper empirically investigates the importance of Feedforward Networks (FFNs) in decoder-only transformers, demonstrating that increasing FFN depth can improve performance and parameter efficiency.
Experiments on language modeling tasks show that models with three-layer FFNs achieve lower training loss with fewer total parameters and reduced computational time compared to conventional two-layer setups.
The results suggest that deeper FFNs significantly contribute to the transformation capabilities of decoder-only models, highlighting the importance of FFN architecture for parameter-efficient designs and future model improvements.

Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

The paper "Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models" presents an empirical investigation into the significance of Feedforward Networks (FFNs) within decoder-only transformer architectures. These architectures, popularized by models like GPT, typically comprise multiple transformer blocks containing multi-head attention (MHA) mechanisms followed by a two-layer FFN. Given the substantial parameter budget allocation to FFNs within these blocks, the paper challenges the conventional emphasis on MHA optimization by demonstrating the critical role and potential improvements through modifications to FFNs.

Experimental Findings

The study involved evaluating various configurations of FFNs within transformer blocks by conducting experiments on language modeling tasks using the Booksum and Wikitext datasets. The architectures tested included FFNs with zero, one, two, and three linear layers, each assessed for model performance with an emphasis on the balance of trainable parameters defined by model depth and dimension size.

The results provide compelling evidence for the positive influence of increasing the FFN complexity within the blocks. Models with three-layer FFNs, despite utilizing fewer transformer blocks, demonstrate superior performance compared to conventional two-layer counterparts. Notably, the configuration with three-layer FFNs and only ten blocks attains lower training loss with fewer total parameters and reduced computational time compared to the baseline setup.

Implications of FFN Design

The significance of the study lies not only in confirming the critical role of FFNs but also in suggesting that deeper FFNs could enhance the transformation capabilities within decoder-only models. The demonstrated improvements hint at FFNs' potential role in efficiently approximating functions and capturing intricate patterns necessary for effective language representation—a nuanced layer that goes beyond the universal function approximation attributed to single-layer FFNs.

From a practical standpoint, integrating more complex FFNs within transformer blocks could lead to more parameter-efficient models. This is inherently valuable in contexts constrained by computational resources, enabling faster pre-training processes while maintaining or enhancing model performance.

Future Directions

While this paper lays the groundwork for reconsidering the architecture of FFNs, several avenues invite further exploration. Future research could investigate even deeper FFNs or experiment with alternative activation functions and dropout configurations to refine layer outputs further. There is scope for studying more diverse datasets, relaxing assumptions about fixed MHA mechanisms, or exploring architectures combining increased depth of FFNs with novel attention mechanisms like FlashAttention or Star Attention.

Additionally, investigating the scalability of these findings across larger models could address the generalizability of results observed on mid-sized architectures. Optimizing hyperparameters specific to enhanced FFN configurations and evaluating model performance using more specialized downstream tasks and metrics beyond cross-entropy loss could also provide insightful benchmarks.

The implications of incorporating more complex FFNs extend to the field of theoretical understanding, reinforcing the importance of architectural decisions in realizing the potential of transformers. This paper thus serves as a critical stepping stone toward a more nuanced comprehension of how FFNs contribute to the expressiveness and overall capabilities of LLMs.

Markdown Report Issue