Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models
Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-LLMs (e.g. CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness: The Attention Refinement module aligns the text-guided attention obtained from the target model via adversarial examples with the text-guided attention acquired from the original model via clean examples. This alignment enhances the model's robustness. Additionally, the Attention-based Model Constraint module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. The experiments validate that our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets. Our code is available at https://github.com/zhyblue424/TGA-ZSR.
- Robust cross-modal representation learning with progressive self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16430–16441, 2022.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
- Language models are few-shot learners. In Advances in neural information processing systems, volume 33, pages 1877–1901, 2020.
- Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223, 2011.
- Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206–2216. PMLR, 2020.
- Revisiting pre-trained models for Chinese natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 657–668, 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- Adversarial robustness via random projection filters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4077–4086, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024.
- Caltech-256 object category dataset. Technical report, Technical Report 7694, California Institute of Technology Pasadena, 2007.
- Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 746–754, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021.
- Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
- Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
- Robust evaluation of diffusion-based adversarial purification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 134–144, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in neural information processing systems, volume 34, pages 9694–9705, 2021.
- Aroid: Improving adversarial robustness through online instance-wise data augmentation. arXiv preprint arXiv:2306.07197, 2023.
- Data augmentation alone can improve adversarial training. In The Eleventh International Conference on Learning Representations, 2022.
- A novel robustness-enhancing adversarial defense approach to ai-powered sea state estimation for autonomous marine vessels. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2024.
- Are data-driven explanations robust against out-of-distribution data? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3821–3831, 2023.
- Exploring visual interpretability for contrastive language-image pre-training. arXiv preprint arXiv:2209.07046, 2022.
- Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21152–21164, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Adversarial robustness through random weight sampling. In Advances in Neural Information Processing Systems, volume 36, pages 37657–37669, 2023.
- Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Understanding zero-shot adversarial robustness for large-scale models. In The Eleventh International Conference on Learning Representations, 2023.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016.
- Diffusion models for adversarial purification. In International Conference on Machine Learning, pages 16805–16827. PMLR, 2022.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- A Omkar M Parkhi. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
- Clip-guided vision-language pre-training for question answering in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5606–5611, 2023.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
- Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18082–18091, 2022.
- Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models. In International Conference on Machine Learning, 2024.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Adversarial training for free! In Advances in neural information processing systems, volume 32, 2019.
- Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
- Class incremental learning for light-weighted networks. IEEE Transactions on Circuits and Systems for Video Technology, 2024.
- Attention is all you need. In Advances in neural information processing systems, volume 30, 2017.
- Rotation equivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11, pages 210–218. Springer, 2018.
- Augmax: Adversarial composition of random augmentations for robust training. In Advances in neural information processing systems, volume 34, pages 237–250, 2021.
- Pre-trained model guided fine-tuning for zero-shot adversarial robustness. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2024.
- Bev-clip: Multi-modal bev retrieval methodology for complex scene in autonomous driving. arXiv preprint arXiv:2401.01065, 2024.
- Adversarial weight perturbation helps robust generalization. In Advances in neural information processing systems, volume 33, pages 2958–2969, 2020.
- Adversarial purification with the manifold hypothesis. In AAAI Conference on Artificial Intelligence, 2022.
- Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767, 2023.
- Filip: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2021.
- Quality-agnostic image captioning to safely assist people with vision impairment. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 6281–6289, 2023.
- Lu Yu and Verena Rieser. Adversarial textual robustness of visual dialog. In Findings of 61st Annual Meeting of the Association for Computational Linguistics 2023, pages 3422–3438. Association for Computational Linguistics, 2023.
- Exploiting the semantic knowledge of pre-trained text-encoders for continual learning. arXiv preprint arXiv:2408.01076, 2024.
- Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
- Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4480–4488, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.