ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Abstract: Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Yolov4: Optimal speed and accuracy of object detection. 2020.
- Reversible column networks. In ICLR, 2023.
- Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 43(5):1483–1498, 2019.
- End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- Mixformer: Mixing features across windows and dimensions. In CVPR, pages 5249–5259, 2022.
- Vision transformer adapter for dense predictions. ICLR, 2023.
- Masked-attention mask transformer for universal image segmentation. 2022.
- Twins: Revisiting the design of spatial attention in vision transformers. NeurIPS, 34:9355–9366, 2021.
- MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
- Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022.
- Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023.
- Neighborhood attention transformer. In CVPR, pages 6185–6194, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Mask r-cnn. TPAMI, 2017.
- Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
- Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
- Mpvit: Multi-path vision transformer for dense prediction. In CVPR, pages 7287–7296, 2022.
- Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022a.
- Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022b.
- Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
- Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296. Springer, 2022c.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
- A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
- Yolo9000: Better, faster, stronger. In CVPR, pages 6517–6525, 2017.
- Yolov3: An incremental improvement. arXiv, 2018.
- You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
- Open-transmind: A new baseline and benchmark for 1st foundation model challenge of intelligent transportation. In CVPR, pages 6327–6334, 2023.
- Inception transformer. NeurIPS, 35:23495–23509, 2022.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021a.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021b.
- Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
- Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
- Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In CVPR, 2023a.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023b.
- Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
- Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
- Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In CVPR, pages 21341–21350, 2022.
- 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. In CVPR, pages 20116–20126, 2023.
- Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In CVPR, pages 21361–21370, 2022a.
- V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In CVPR, pages 5486–5495, 2023.
- Metaformer is actually what you need for vision. In CVPR, pages 10819–10829, 2022b.
- Spanet: Frequency-balancing token mixer using spectral pooling aggregation modulation. In ICCV, pages 6113–6124, 2023.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
- Semantic understanding of scenes through the ade20k dataset. IJCV, 127:302–321, 2019.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, pages 16804–16815, 2022.
- Detrs with collaborative hybrid assignments training, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.