Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

Published 23 Jun 2021 in cs.CV | (2106.12368v1)

Abstract: In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.

Abstract PDF Upgrade to Chat

Citations (189)

View on Semantic Scholar

Summary

The paper presents a novel architecture that encodes spatial information by projecting features separately along height and width.
It achieves 81.5% top-1 accuracy on ImageNet with 25M parameters and scales to 83.2% with 88M parameters under similar conditions.
This work challenges traditional convolution and attention methods, opening new avenues for efficient visual recognition and broader computer vision applications.

Overview of "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition"

The paper entitled "Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition" provides an innovative approach to the design of multi-layer perceptron (MLP)-like architectures for visual recognition tasks. Recent explorations into pure MLP networks have demonstrated their competitive performance on image classification tasks, especially when trained on large datasets. This research advances this domain by proposing the Vision Permutator, a model that efficiently encodes spatial information without relying on spatial convolutions or attention mechanisms used in conventional methodologies like CNNs and Vision Transformers.

Key Contributions

The Vision Permutator introduces a novel mechanism for encoding spatial information by maintaining the inherent two-dimensional structure of input data. This is in contrast to most existing MLP-based models that flatten these dimensions, thereby losing valuable positional information. In the Vision Permutator architecture, feature representations are encoded separately along the height and width dimensions through linear projections. This architectural choice enables the model to capture long-range dependencies efficiently in one spatial direction while preserving precise positional information in the other.

Among the significant achievements reflected in the empirical results, the Vision Permutator achieves an 81.5% top-1 accuracy on the ImageNet dataset using only 25 million parameters. This performance surpasses many existing CNN and Vision Transformer models under similar model size conditions. When the model is scaled to 88 million parameters, it reaches an 83.2% top-1 accuracy, demonstrating an enhanced ability to scale effectively for improved accuracy.

Implications and Future Directions

The implications of the Vision Permutator are twofold. Practically, the architecture provides a more computationally efficient alternative to CNNs and Vision Transformers for visual recognition tasks by eliminating the need for spatial convolutions and attention mechanisms. Theoretically, this work challenges existing paradigms related to spatial information encoding, encouraging further exploration into new methodologies for neural network design in computer vision.

The Vision Permutator opens several avenues for future research. The architecture's effectiveness in handling various downstream tasks such as object detection and semantic segmentation is of particular interest. Furthermore, addressing the challenge of processing input images of arbitrary dimensions remains an open task and an exciting direction for future improvements in MLP-like model designs.

In summary, the Vision Permutator stands as a robust model that effectively marries the principles of MLP architectures with spatial information encoding, offering a compelling case for continued investigation and development within this domain of visual computing.