Bridging the Divide: Reconsidering Softmax and Linear Attention

Published 9 Dec 2024 in cs.CV | (2412.06590v1)

Abstract: Widely adopted in modern Vision Transformer designs, Softmax attention can effectively capture long-range visual information; however, it incurs excessive computational cost when dealing with high-resolution inputs. In contrast, linear attention naturally enjoys linear complexity and has great potential to scale up to higher-resolution images. Nonetheless, the unsatisfactory performance of linear attention greatly limits its practical application in various scenarios. In this paper, we take a step forward to close the gap between the linear and Softmax attention with novel theoretical analyses, which demystify the core factors behind the performance deviations. Specifically, we present two key perspectives to understand and alleviate the limitations of linear attention: the injective property and the local modeling ability. Firstly, we prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors, thus adding to severe semantic confusion since different queries correspond to the same outputs. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short. The aforementioned two fundamental differences significantly contribute to the disparities between these two attention paradigms, which is demonstrated by our substantial empirical validation in the paper. In addition, more experiment results indicate that linear attention, as long as endowed with these two properties, can outperform Softmax attention across various tasks while maintaining lower computation complexity. Code is available at https://github.com/LeapLabTHU/InLine.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that linear attention’s lack of injectivity causes semantic confusion, reducing expressiveness compared to Softmax.
It introduces two key modifications—switching to subtractive normalization and adding an MLP residual—to boost local modeling.
Experimental results on benchmarks like ImageNet and COCO show that the modified linear attention can rival or outperform Softmax with lower complexity.

Bridging the Divide: Reconsidering Softmax and Linear Attention

This paper addresses the performance and complexity trade-offs between the Softmax and linear attention mechanisms, specifically within the context of Vision Transformers. The authors propose a detailed theoretical framework attending to the core issues that have historically diminished the efficacy of linear attention in vision tasks. The analysis is concentrated around two pivotal properties: injectivity and local modeling capability.

In contemporary Vision Transfomer implementations, Softmax attention is widely recognized for its ability to model long-range dependencies, thus producing state-of-the-art results across various computer vision applications. However, this proficiency comes at the cost of quadratic complexity concerning input size, which poses significant computational challenges in high-resolution scenarios. Linear attention, while offering a reduced computational complexity of O(N) as opposed to Softmax's O(N^2), typically falls short in expressivity and practical utility due to inherent limitations in its architecture.

The authors propose a critical analysis of these limitations and focus on the injectivity of the attention functions. They mathematically demonstrate that linear attention lacks the injective property, resulting in different queries often being mapped to the same attention weights. This non-injectivity is a source of semantic confusion which, in turn, diminishes the expressiveness of linear attention. Conversely, Softmax attention is proven to be injective under reasonable assumptions, which helps avoid such semantic confusion. The notion of injectivity is articulated through both theoretical proofs and experimental observations, with the empirical data highlighting scenarios where linear attention suffers from such confusion problems in real-world models.

In addressing the local modeling capability, the paper underscores the need for effective local attention to complement robust long-range modeling. Although Softmax attention is inherently adept at maintaining a large receptive field, it is also proficient in local modeling, a feature that linear attention lacks. The empirical analysis reveals that the performance differentiation between Softmax and linear attention is largely attributable to their respective abilities in local modeling.

To mitigate these identified issues with linear attention, the authors propose two modifications: transforming the normalization approach from division to subtraction to ensure injectivity, and incorporating an MLP to introduce a local residual term that strengthens local attention bias. These modifications were validated using Swin Transformer architectures, indicating that the proposed changes enabled linear attention to match or even exceed Softmax performance across various benchmarks, including ImageNet classification, COCO object detection, and ADE20K semantic segmentation.

The paper asserts that with the introduction of injective properties and enhanced local modeling, linear attention can not only rival Softmax in efficacy but also outperform it in computationally intensive settings due to its reduced complexity. This has far-reaching implications for deploying Vision Transformers in high-resolution environments, where computational resources are a limiting factor.

Future research prompted by these findings may involve the exploration of even more efficient attention mechanisms that integrate the injective properties and local modeling capabilities. It could entail leveraging advanced kernel functions, integrating with existing state-of-the-art attention configurations, or expanding these concepts into multimodal applications.

In summary, this paper provides a significant analytical perspective on the core characteristics limiting the performance of linear attention. It proposes a novel approach that allows linear attention to leverage its computational advantages without sacrificing expressiveness, thus bridging the performance gap with Softmax in a meaningful way.

Markdown Report Issue