Centered Self-Attention Layers

Published 2 Jun 2023 in cs.LG | (2306.01610v1)

Abstract: The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied within deep learning architectures. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers for different tokens in transformers and different nodes in graph neural networks. Based on our analysis, we present a correction term to the aggregating operator of these mechanisms. Empirically, this simple term eliminates much of the oversmoothing problem in visual transformers, obtaining performance in weakly supervised segmentation that surpasses elaborate baseline methods that introduce multiple auxiliary networks and training phrases. In graph neural networks, the correction term enables the training of very deep architectures more effectively than many recent solutions to the same problem.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that adding a centering term prevents oversmoothing by normalizing softmax activations, leading to more distinct feature representations.
It employs simulations and real-world experiments in visual transformers and GNNs to show improvements in weakly supervised semantic segmentation and other tasks.
The method simplifies model architecture, outperforming advanced normalization techniques while enhancing interpretability and efficiency.

Introduction to Centered Self-Attention Layers

Transformers, known for their self-attention mechanisms, have become increasingly prevalent in the fields of NLP and computer vision. This article explores the discovery made by researchers regarding a phenomenon known as "oversmoothing" that occurs in these models. In simple terms, oversmoothing is when different parts of a data input become too similar—losing their unique characteristics—as they pass through multiple layers of a neural network. The study proposes an innovative yet straightforward adjustment to the self-attention layers that remarkably alleviates this issue, particularly within visual transformers and graph neural networks (GNNs).

The Oversmoothing Challenge

The crux of the problem lies in the structure of deep neural networks, where the multiplication of attention matrices causes a convergence to a rank-1 matrix as the network depth increases. This leads to representations that are too similar or "smooth", especially evident in deeper architectures that rely heavily on the stacking of attention layers—such as Transformers and GNNs. The consequences of this effect are especially problematic in tasks like weakly supervised semantic segmentation (WSSS), where the model's ability to distinguish between different parts of an image is crucial.

The Proposed Solution

In response to this challenge, the researchers propose the introduction of a centering term within the attention mechanism. This term effectively centers the softmax activations to zero instead of one. Through a series of simulations and experiments, the study found that this relatively simple change could prevent the oversmoothing effect, leading to more expressive and diverse representations. Particularly, for WSSS tasks, this approach yielded superior performance using a less complex framework compared to other sophisticated methods.

Practical Implications and Results

To validate this approach, several experimental setups were conducted, encompassing both synthetic simulations and real-world applications in WSSS and GNNs. The results were promising, with the new method outperforming existing solutions that rely on more complex strategies or require additional network components. For GNNs, adding the correction term to a conventional GNN model led to better performance, even bypassing recent advanced normalization techniques. These findings highlight the practicality and efficiency of using centered self-attention layers to mitigate oversmoothing, simplifying the architecture while enhancing performance.

Future Prospects and Summary

The implications of this novel approach extend beyond just improving current Transformer and GNN models. This study opens up new avenues for future research aimed at understanding the full effects of centering terms on a network's inductive biases and training dynamics. Essentially, this breakthrough simplifies the architecture of neural networks, focusing on a foundational improvement that boosts the interpretability and expressiveness of deep learning models in tasks requiring fine-grained attention to detail.

Markdown Report Issue