ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation
Abstract: Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.
- Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9536–9545, 2021.
- Language models are few-shot learners. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
- Zero-shot semantic segmentation. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 32, 2019.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(4):834–848, 2017.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2613–2622, 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1290–1299, 2022.
- Per-pixel classification is not all you need for semantic segmentation. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 34:17864–17875, 2021.
- SIGN: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9556–9566, 2021.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Zero-shot out-of-distribution detection based on the pretrained model clip. In Proceedings of the AAAI conference on artificial intelligence (AAAI), 2022.
- Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9716–9725, 2021.
- Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12094–12103, 2022.
- Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the ACM International Conference on Multimedia (ACMMM), pages 1921–1929, 2020.
- On pre-trained image features and synthetic images for deep learning. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4904–4916, 2021.
- Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
- Finetuning pretrained vision-language models with correlation information bottleneck for robust visual question answering. arXiv preprint arXiv:2209.06954, 2022.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855, 2019.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
- Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 82–92, 2019.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the Association for Computational Linguistics (ACL), pages 61–68, 2022.
- Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2125–2134, 2021.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
- Video object segmentation with episodic graph memory networks. In Proceedings of the IEEE conference on European Conference on Computer Vision (ECCV), pages 661–679. Springer, 2020.
- Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8364–8375, 2022.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), pages 565–571. IEEE, 2016.
- Segmentation in style: Unsupervised semantic image segmentation with stylegan and clip. arXiv preprint arXiv:2107.12518, 2021.
- A closer look at self-training for zero-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2693–2702, 2021.
- Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18082–18091, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), pages 234–241, 2015.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7262–7272, 2021.
- Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
- Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10052–10062, 2021.
- Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the International Conference on Software Engineering (ICSE), pages 287–298, 2022.
- Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11686–11695, 2022.
- Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8256–8265, 2019.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 34:12077–12090, 2021.
- Scale-aware graph neural network for few-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5475–5484, 2021.
- Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340, 2022.
- Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6984–6993, 2021.
- A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. arXiv preprint arXiv:2112.14757, 2021.
- CPT: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
- Segvit: Semantic segmentation with plain vision transformers. arXiv preprint arXiv:2210.05844, 2022.
- Context encoding for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7151–7160, 2018.
- Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12083–12093, 2022.
- Rethinking dice loss for medical image segmentation. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 851–860. IEEE, 2020.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6881–6890, 2021.
- Extract free dense labels from clip. In Proceedings of the IEEE conference on European Conference on Computer Vision (ECCV), 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 130(9):2337–2348, 2022.
- A unified efficient pyramid transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2667–2677, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.