ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

Published 12 Mar 2024 in cs.CV | (2403.07392v3)

Abstract: Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.

Abstract PDF HTML Upgrade to Chat

References (56)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a novel hybrid ViT architecture that integrates convolutional multi-scale features to boost dense prediction performance.
The methodology enables bidirectional feature interactions between CNNs and transformers, achieving competitive metrics on COCO and ADE20K benchmarks.
The approach eliminates the need for extensive pre-training, offering practical advantages for real-world applications like autonomous driving and medical imaging.

Overview of ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

The paper introduces ViT-CoMer, a novel architecture designed to enhance the performance of Vision Transformers (ViTs) in dense prediction tasks. Dense prediction tasks, such as object detection, instance segmentation, and semantic segmentation, demand the capture of intricate and localized features from images. The traditional ViT, while successful in general vision tasks, struggles with dense predictions due to its inherent lack of local interaction within patches and limited feature scale diversity. This paper proposes a solution to these challenges without resorting to costly pre-training procedures typical of other transformer architectures in computer vision.

Key Innovations

Integration of Convolutional Features: ViT-CoMer introduces spatial pyramid multi-receptive field convolutional features into the ViT architecture. This integration addresses the existing ViT limitations by enhancing local information interaction and feature representation diversity. By employing a convolutional module, the architecture exploits the inherent advantage of convolutions in capturing local patterns.
Bidirectional CNN-Transformer Interaction: The authors propose a simple, efficient module that facilitates bidirectional feature interaction between CNN layers and the transformer. This interaction occurs across multiple scales, thereby effectively leveraging hierarchical feature fusion advantageous for dense predictions.
Pre-training-free Framework: One of the standout features of ViT-CoMer is its ability to bypass extensive pre-training requirements. The architecture allows for the direct utilization of open-source, advanced pre-trained weights from previously established transformers, ensuring time and resource efficiency without sacrificing performance.

Experimental Results

ViT-CoMer demonstrates commendable performance across several benchmarking datasets:

In object detection tasks evaluated on the COCO val2017 dataset, ViT-CoMer-L achieved an Average Precision (AP) of 64.3% without extra training data, comparable to state-of-the-art methods.
For semantic segmentation on the ADE20K val dataset, the method reached 62.1% mIoU, again matching the performance of leading architectures.

The architecture’s flexibility is further underscored by successful tests on various pre-training scenarios and dense prediction benchmarks, showcasing its adaptability and robustness.

Implications and Future Directions

Practical Implications: ViT-CoMer presents a compelling option for practitioners seeking efficient, high-performance models for dense prediction tasks. Its integration of convolutional features with transformers offers a balanced approach, leveraging the best aspects of both methodologies. This makes it a viable choice in applications where dense predictions are critical, like autonomous driving and medical imaging.

Theoretical Insights: By addressing the interaction challenges within ViT architectures through convolutional enhancements, the paper introduces a pathway for reconciling the strengths of CNNs and transformers. This offers a blueprint for future research in hybrid architectures that seek to optimize the trade-offs between local and global feature extractions.

Speculation on Future AI Developments: The exploration of architectures like ViT-CoMer signals a broader trend in AI towards more integrated and hybrid models. Future developments could see even deeper integrations of various neural network paradigms, potentially leading to unified frameworks that negate the need for architecture-specific specializations in vision tasks.

Conclusion

ViT-CoMer is a strategic advancement in the quest to improve Vision Transformer performance for dense prediction tasks. By innovatively combining convolutional multi-scale feature interactions within the framework of a ViT, the authors present a model that not only elevates performance but also exhibits practicality in terms of pre-training and application scalability. This contributes a valuable perspective to the ongoing development of efficient, versatile models capable of handling complex vision tasks in real-world scenarios.

Markdown Report Issue