Vision GNN: An Image is Worth Graph of Nodes

Published 1 Jun 2022 in cs.CV | (2206.00272v3)

Abstract: Network architecture plays a key role in the deep learning-based computer vision system. The widely-used convolutional neural network and transformer treat the image as a grid or sequence structure, which is not flexible to capture irregular and complex objects. In this paper, we propose to represent the image as a graph structure and introduce a new Vision GNN (ViG) architecture to extract graph-level feature for visual tasks. We first split the image to a number of patches which are viewed as nodes, and construct a graph by connecting the nearest neighbors. Based on the graph representation of images, we build our ViG model to transform and exchange information among all the nodes. ViG consists of two basic modules: Grapher module with graph convolution for aggregating and updating graph information, and FFN module with two linear layers for node feature transformation. Both isotropic and pyramid architectures of ViG are built with different model sizes. Extensive experiments on image recognition and object detection tasks demonstrate the superiority of our ViG architecture. We hope this pioneering study of GNN on general visual tasks will provide useful inspiration and experience for future research. The PyTorch code is available at https://github.com/huawei-noah/Efficient-AI-Backbones and the MindSpore code is available at https://gitee.com/mindspore/models.

Abstract PDF Upgrade to Chat

Citations (291)

View on Semantic Scholar

Summary

The paper proposes a novel approach that models images as graphs of interconnected patches, offering flexibility beyond traditional grid-based models.
It utilizes a Grapher module for inter-node communication and a feed-forward network to maintain feature diversity across scales.
Experiments on ImageNet and COCO demonstrate that the pyramid ViG-S model achieves 82.1% top-1 accuracy, surpassing several state-of-the-art methods.

Vision GNN: An Image is Worth Graph of Nodes

The paper "Vision GNN: An Image is Worth Graph of Nodes" presents a novel approach in the field of computer vision, leveraging Graph Neural Networks (GNNs) to more effectively represent and process image data. This work addresses the limitations of traditional convolutional neural networks (CNNs) and transformer architectures that treat images as grids or sequences. By proposing a graph-based representation, the authors introduce a flexible and adaptive framework to better capture irregular and complex objects in visual tasks.

Core Concept and Architecture

The central idea of the Vision GNN (ViG) is to represent an image as a graph where each node corresponds to a patch of the image. These nodes are interconnected based on similarity, forming a graph structure. This approach allows the model to naturally account for the non-uniform shapes and complex structures of objects within images.

The ViG architecture consists of two primary modules:

Grapher Module: Utilizes graph convolution to aggregate and update node information. It facilitates inter-node communication and is enhanced by multi-head update operations to transform node features across multiple subspaces.
FFN Module (Feed-Forward Network): Aids in feature transformation and counteracts the over-smoothing problem often encountered in conventional GNNs, thereby maintaining feature diversity among nodes.

Two architectures are proposed for ViG: isotropic, where feature dimensions remain constant throughout, and pyramid, which progressively reduces spatial dimensions to capture multi-scale features. Both architectures demonstrate the flexibility of ViG across diverse model sizes.

Experimental Validation

The efficacy of the ViG architecture is evaluated on standard benchmarks, namely ImageNet for image classification and COCO for object detection. The results demonstrate that ViG not only achieves competitive performance with existing state-of-the-art models, such as CNNs, transformers, and MLPs, but also surpasses them in certain configurations.

Notably, the pyramid ViG-S model achieves a top-1 accuracy of 82.1% on the ImageNet classification task, surpassing models like ResNet, CycleMLP, and Swin-Transformer, with similar computational costs. This indicates the potential of graph-based models to outperform conventional architectures when applied to large-scale vision tasks.

Theoretical and Practical Implications

By incorporating graph structures, this work theoretically expands the versatility of neural networks in processing data beyond regular grids or sequences. Practically, it underscores the potential applications of GNNs in visual recognition tasks, offering a promising direction for future research.

The graph-based approach is particularly advantageous for applications involving irregular data structures, where traditional methods might suffer from redundancies and inflexibility. Furthermore, this research paves the way for potential extensions into other domains where data can benefit from a graph representation.

Conclusion and Future Directions

The Vision GNN framework introduced in this paper highlights the profound capabilities of using graph representations for image data, showcasing an innovative alternative to longstanding vision models. The promising results set a foundation for further exploration into more sophisticated graph-based architectures that can harness the full potential of image data. Future research could explore optimizations of graph construction techniques, enhancements in graph convolution operations, and applications across various visual tasks, potentially extending beyond structured datasets.