On the Integration of Self-Attention and Convolution

Published 29 Nov 2021 in cs.CV | (2111.14556v2)

Abstract: Convolution and self-attention are two powerful techniques for representation learning, and they are usually considered as two peer approaches that are distinct from each other. In this paper, we show that there exists a strong underlying relation between them, in the sense that the bulk of computations of these two paradigms are in fact done with the same operation. Specifically, we first show that a traditional convolution with kernel size k x k can be decomposed into k² individual 1x1 convolutions, followed by shift and summation operations. Then, we interpret the projections of queries, keys, and values in self-attention module as multiple 1x1 convolutions, followed by the computation of attention weights and aggregation of the values. Therefore, the first stage of both two modules comprises the similar operation. More importantly, the first stage contributes a dominant computation complexity (square of the channel size) comparing to the second stage. This observation naturally leads to an elegant integration of these two seemingly distinct paradigms, i.e., a mixed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while having minimum computational overhead compared to the pure convolution or self-attention counterpart. Extensive experiments show that our model achieves consistently improved results over competitive baselines on image recognition and downstream tasks. Code and pre-trained models will be released at https://github.com/LeapLabTHU/ACmix and https://gitee.com/mindspore/models.

Abstract PDF Upgrade to Chat

Citations (241)

View on Semantic Scholar

Summary

The paper introduces ACmix, a novel model that leverages shared 1x1 convolution operations to integrate self-attention and convolution efficiently.
Extensive experiments demonstrate that ACmix achieves improved accuracy on image recognition tasks while reducing computational overhead.
The study offers new theoretical insights into unified model architectures and paves the way for efficient implementations in resource-constrained environments.

On the Integration of Self-Attention and Convolution

The paper "On the Integration of Self-Attention and Convolution" explores the convergence of two fundamental paradigms in representation learning: self-attention and convolution. These techniques are pivotal in contemporary AI, particularly in tasks involving image and feature processing. The authors reveal that while traditionally considered distinct, convolution and self-attention share a core computational operation, which can be leveraged to create a mixed model with reduced computational cost.

Core Contributions

Relationship Between Convolution and Self-Attention:
- The paper elucidates that the fundamental operations in these two paradigms can be reduced to similar processes, specifically $1\!\times \!1$ convolutions. The convolution operation can be decomposed into multiple $1\!\times \!1$ convolutions, followed by shift and summation operations. Similarly, self-attention involves $1\!\times \!1$ convolutions for projecting queries, keys, and values, and then computing attention weights.
ACmix:
- Based on the relationship between convolution and self-attention, the authors propose a hybrid model named ACmix. This model integrates the strengths of both paradigms with minimal computational overhead compared to using either of these methods in isolation.

Numerical and Empirical Results

The ACmix model shows a reduction in computational overhead due to the shared operations for feature projection across both paradigms.
Extensive experiments on image recognition tasks demonstrate consistent improvements over existing baseline models, exhibiting higher accuracy with comparable or reduced complexity.

Implications and Future Directions

The integration of self-attention and convolution offers both theoretical and practical implications. Theoretically, it provides a new lens to view the underlying operations of these paradigms, suggesting unified architectures for future AI models. Practically, it reduces computational demands, making it feasible to deploy efficient models in resource-constrained environments.

Future developments could explore further optimizations in combining these paradigms, possibly incorporating additional operations or adaptations for specific tasks. Additionally, it would be worthwhile to examine how these insights might apply beyond vision tasks, potentially influencing model architectures in NLP or other domains.

In conclusion, the paper makes a significant contribution to understanding and combining two dominant paradigms in AI, fostering innovation in model architecture design and computational efficiency.

Markdown Report Issue