DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Published 4 Feb 2024 in cs.CL and cs.LG | (2402.02622v2)

Abstract: The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

Abstract PDF Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a Depth-Weighted Average module that fuses outputs from previous Transformer blocks, enhancing data flow and mitigating diminishing returns.
The paper demonstrates that DenseFormer achieves lower perplexity and faster inference using similar or smaller model sizes compared to standard Transformers.
The paper's analysis reveals stable inter-block connectivity patterns, validating the architecture's efficiency and paving the way for future research.

Enhancing Information Flow in Transformers with DenseFormer Architecture

Introduction

Transformers have become the foundation for advancements in various domains such as natural language processing, computer vision, and more. One of the critical challenges in scaling Transformers is the diminishing returns observed with increasing model depth. This paper introduces DenseFormer, a novel architecture modification that aims to mitigate the issue of diminishing returns by incorporating a Depth-Weighted Average (DWA) module after each Transformer block to facilitate enhanced information flow across layers.

Methodology

DenseFormer Architecture

The core innovation of the DenseFormer architecture is the introduction of the Depth-Weighted Average (DWA) module. This module computes a weighted average of the outputs from all previous blocks and the current block, enabling direct access to earlier representations without the need for passing through multiple subsequent layers. This mechanism is inspired by the DenseNet architecture in computer vision and aims to alleviate the problem of diminishing returns by fostering a richer and more direct flow of information across the model.

Implementation Details

Implementing the DenseFormer requires minimal adjustments to the standard Transformer architecture, with the main addition being the DWA module. The paper details efficient implementation strategies for the DWA to minimize computational and memory overhead. Furthermore, Dilated DenseFormer is introduced as a variant that reduces the computational cost by applying a dilation factor to the DWA calculations, selectively using outputs from every k-th block, thereby offering a tunable trade-off between model performance and efficiency.

Experimental Results

Extensive experiments across various datasets, including OpenWebText2 and PG-19, demonstrate that DenseFormer models achieve better performance metrics (e.g., perplexity) than their Transformer counterparts with similar or smaller model sizes. Notably, DenseFormer models exhibit superior data efficiency, reaching the same perplexity levels as much deeper Transformer models while being faster and less memory-intensive. The paper also explores the impact of different dilation factors and shows that small dilation values can significantly improve inference speed without considerably degrading performance.

Analysis of Information Flow

The paper provides an in-depth analysis of the learned DWA weights, uncovering stable patterns that reveal the strategic reuse of activations from distant layers. This analysis suggests that even though many of the DWA weights are small, they play a critical role in the DenseFormer's improved performance. Furthermore, experiments that modified the sparsity patterns during training confirmed the importance of the DenseFormer’s unique inter-block connectivity pattern for achieving optimal results.

Future Directions and Conclusion

The DenseFormer presents a compelling case for its adoption in large-scale models, offering improvements in efficiency and performance. The paper suggests avenues for future research, including the exploration of more efficient implementations and the investigation of alternative sparsity patterns that could further enhance model performance. DenseFormer's success in leveraging depth-weighted averaging hints at a promising direction for overcoming the limitations of current Transformer architectures, potentially paving the way for the development of more efficient and powerful models in various domains.