PromptKD: Unsupervised Prompt Distillation for Vision-Language Models
Abstract: Prompt learning has emerged as a valuable technique in enhancing vision-LLMs (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
- Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
- Pkd: General distillation framework for object detectors via pearson correlation coefficient. arXiv preprint arXiv:2207.02039, 2022.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
- Knowledge distillation with the reused teacher classifier. In CVPR, pages 11933–11942, 2022.
- E2vpt: An effective and efficient approach for visual prompt tuning. In ICCV, 2023.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pages 178–178. IEEE, 2004.
- Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
- Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687, 2022.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021a.
- Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
- Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
- Improving zero-shot models with label distribution priors. arXiv preprint arXiv:2212.00784, 2022.
- Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, pages 15190–15200, 2023b.
- 3d object representations for fine-grained categorization. In ICCV workshop, pages 554–561, 2013.
- Improving clip robustness with knowledge distillation and self-training. arXiv preprint arXiv:2309.10361, 2023.
- Read-only prompt optimization for vision-language few-shot learning. In ICCV, pages 1401–1411, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Online knowledge distillation via multi-branch diversity enhancement. In ACCV, 2020.
- Online knowledge distillation for efficient pose estimation. In ICCV, pages 11740–11750, 2021.
- Curriculum temperature for knowledge distillation. In AAAI, pages 1504–1512, 2023.
- Structured knowledge distillation for semantic segmentation. In CVPR, pages 2604–2613, 2019.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Enhancing clip with clip: Exploring pseudolabeling for limited-label prompt tuning. arXiv preprint arXiv:2306.01669, 2023.
- Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. In NeurIPS, 2023.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
- Relational knowledge distillation. In CVPR, pages 3967–3976, 2019.
- Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
- Clipping: Distilling clip-based models with a student base for video-language retrieval. In CVPR, pages 18983–18992, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR, pages 18082–18091, 2022.
- Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400. PMLR, 2019.
- Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition. arXiv preprint arXiv:2304.04704, 2023.
- Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
- Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS, 35:14274–14289, 2022.
- Transductive unbiased embedding for zero-shot learning. In CVPR, pages 1024–1033, 2018.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Transductive zero-shot learning with visual structure constraint. NeurIPS, 32, 2019.
- Learning to decompose visual features with latent textual prompts. ICLR, 2023a.
- Learning robust global representations by penalizing local predictive power. NeurIPS, 32, 2019a.
- Crosskd: Cross-head knowledge distillation for dense object detection. arXiv preprint arXiv:2306.11369, 2023b.
- A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–37, 2019b.
- Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In ICCV, pages 21970–21980, 2023.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. T-PAMI, 41(9):2251–2265, 2018a.
- Feature generating networks for zero-shot learning. In CVPR, pages 5542–5551, 2018b.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
- Mutual contrastive learning for visual representation learning. In AAAI, pages 3045–3053, 2022a.
- Cross-image relational knowledge distillation for semantic segmentation. In CVPR, pages 12319–12328, 2022b.
- Clip-kd: An empirical study of distilling clip models. arXiv preprint arXiv:2307.12732, 2023.
- Knowledge distillation via softmax regression representation learning. In ICLR, 2021.
- Fine-grained visual prompting. NeurIPS, 36, 2024.
- Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. NeurIPS, 35:9125–9138, 2022.
- Learning a deep embedding model for zero-shot learning. In CVPR, pages 2021–2030, 2017.
- Deep mutual learning. In CVPR, pages 4320–4328, 2018.
- Decoupled knowledge distillation. In CVPR, pages 11953–11962, 2022.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.