Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Published 20 Jun 2023 in cs.CV and cs.CL | (2306.11400v2)

Abstract: Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-LLMs like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-LLMs, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)
  1. Lei Ba J., Swersky K. and Fidler S., “Predicting deep zero-shot convolutional neural networks using textual descriptions,” IEEE International Conference on Computer Vision. 2015, pp. 4247-4255.
  2. Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov and Y. Cao, “Simple visual language model pretraining with weak supervision,” International Conference on Learning Representations. 2022.
  3. Song H., Dong L., Zhang W., Liu T. and Wei F., “Clip models are few-shot learners: empirical studies on vqa and visual entailment,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 6088-6100.
  4. Zhou K., Yang J., Loy C. C. and Liu Z., “Learning to prompt for vision-language models,” International Journal of Computer Vision. 2022, 130(9) pp. 2337-2348.
  5. Zhou K., Yang J., Loy C. C. and Liu Z., “Conditional prompt learning for vision-language models,” IEEE Conference on Computer Vision and Pattern Recognition. 2022, pp. 16816-16825.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.