Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks

Published 26 Oct 2023 in cs.LG | (2310.17683v1)

Abstract: As one of the most popular neural network modules, Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing. The effectiveness of the Transformer is often attributed to its multi-head attention (MHA) mechanism. In this study, we discuss the limitations of MHA, including the high computational complexity due to its query-key-value'' architecture and the numerical issue caused by its softmax operation. Considering the above problems and the recent development tendency of the attention layer, we propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simpleslicing-sorting'' operation, i.e., projecting inputs linearly to a latent space and sorting them along different feature dimensions (or equivalently, called channels). For each feature dimension, the sorting operation implicitly generates an implicit attention map with sparse, full-rank, and doubly-stochastic structures. We consider different implementations of the slicing-sorting operation and analyze their impacts on the Sliceformer. We test the Sliceformer in the Long-Range Arena benchmark, image classification, text classification, and molecular property prediction, demonstrating its advantage in computational complexity and universal effectiveness in discriminative tasks. Our Sliceformer achieves comparable or better performance with lower memory cost and faster speed than the Transformer and its variants. Moreover, the experimental results reveal that applying our Sliceformer can empirically suppress the risk of mode collapse when representing data. The code is available at \url{https://github.com/SDS-Lab/sliceformer}.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Sliceformer, replacing multi-head attention with a slicing-sorting method that cuts complexity from O(DN²) to O(DN log N) and increases numerical stability.
It demonstrates improved performance across benchmarks such as LRA, image classification, text classification, and molecular property prediction with reduced runtime and memory usage.
The approach not only achieves competitive accuracy but also reduces model size, making it a compelling choice for discriminative tasks in diverse applications.

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks

Introduction

The study titled "Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks" presents an innovative approach to improving Transformer-based models by introducing a new mechanism called Sliceformer. This surrogate model aims to replace the traditional Multi-head Attention (MHA) mechanism with a simple "slicing-sorting" operation, addressing key limitations in computational complexity and numerical instability associated with transformers. The Sliceformer showcases its effectiveness across various tasks, including the Long-Range Arena (LRA) benchmark, image classification, text classification, and molecular property prediction.

Limitations of Multi-head Attention and the Slicing-Sorting Solution

The traditional MHA mechanism in transformers is well-known for its efficiency in capturing dependencies within sequences, but it suffers from significant drawbacks:

High Computational Complexity: The quadratic complexity arises from the "query-key-value" (QKV) architecture, which poses a bottleneck in processing long sequences.
Numerical Instability: The softmax operation essential to the mechanism can lead to over-smooth outputs, especially when handling long sequences.

In the Sliceformer, these issues are mitigated by replacing the MHA with a lightweight slicing-sorting operation:

Slicing-Sorting Operation: Inputs are linearly projected into a latent space and subsequently sorted along feature dimensions, implicitly generating sparse, full-rank, doubly stochastic attention maps. This transformation significantly reduces computational complexity to $\mathcal{O}(DN\log N)$ , where $N$ is the sequence length, compared to the $O(DN^2)$ of MHA.
Figure 1: The comparison for various Transformers and Sliceformer on the LRA benchmark, showing efficiency across training steps per second.

The slicing-sorting operation balances computational efficiency and model performance, avoiding numerical instability even with long sequences. This makes Sliceformer a compelling choice for discriminative tasks where such sequence and performance constraints are pivotal.

Experimental Evaluation and Applications

Long-Range Arena (LRA) Benchmark

The Sliceformer was evaluated on the LRA benchmark, specifically designed for assessing models on long-sequence tasks. The results highlight its superior computational efficiency and competitive performance compared to contemporary Transformer variants. On tasks like CIFAR-10 and Pathfinder, Sliceformer demonstrated markedly improved accuracy with reduced runtime and memory footprint.

Figure 2: Visual representation of LRA benchmark task efficiencies with varying sequence lengths.

Image and Text Classification

For image classification tasks, the Sliceformer was adapted to vision tasks by replacing ViT's MHA layers. This adaptation not only reduced model size significantly but also improved top-1 classification accuracy across datasets such as CIFAR-10 and CIFAR-100.

Similarly, in text classification with the IMDB dataset, the Sliceformer achieved higher accuracy with a smaller model footprint than traditional transformers, reinforcing its adaptability and efficiency in diverse applications.

Molecular Property Prediction

Introducing the slicing-sorting mechanism into graph-based tasks, the Sliceformer showed promise in molecular datasets, achieving comparable performance to Graphormer while reducing model size by approximately 30%.

Conclusion and Future Directions

The Sliceformer presents a significant step toward efficient transformer models by addressing MHA complexities with a simple yet effective slicing-sorting operation. While its current implementation excels in discriminative tasks, future research could explore extending its application to generative tasks, enhancing its model capacity, and validating its potential in broader AI contexts. The development of differentiable slicing-sorting operations and their applications in AI for drug discovery and structured data modeling remains an exciting direction for future exploration.