Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Published 16 Apr 2024 in cs.CV | (2404.10357v2)

Abstract: Vision-LLMs (VLMs), such as CLIP, play a foundational role in various cross-modal applications. To fully leverage VLMs' potential in adapting to downstream tasks, context optimization methods like Prompt Tuning are essential. However, one key limitation is the lack of diversity in prompt templates, whether they are hand-crafted or learned through additional modules. This limitation restricts the capabilities of pretrained VLMs and can result in incorrect predictions in downstream tasks. To address this challenge, we propose Context Optimization with Multi-Knowledge Representation (CoKnow), a framework that enhances Prompt Learning for VLMs with rich contextual knowledge. To facilitate CoKnow during inference, we trained lightweight semantic knowledge mappers, which are capable of generating Multi-Knowledge Representation for an input image without requiring additional priors. Experimentally, We conducted extensive experiments on 11 publicly available datasets, demonstrating that CoKnow outperforms a series of previous methods. We will make all resources open-source: https://github.com/EMZucas/CoKnow.

Abstract PDF HTML Upgrade to Chat

References (37)

Summary

The paper introduces CoKnow, a novel framework using Multi-Knowledge Representation (VK, NVK, PK) to optimize prompt learning for Vision-Language Models like CLIP by diversifying prompt contexts.
CoKnow incorporates a prompt optimizer guided by generated Multi-Knowledge and a lightweight semantic knowledge mapper to enhance adaptive learning of rich, domain-specific prompts.
Experiments across 11 datasets demonstrate that CoKnow consistently outperforms prior prompt learning methods, showing improved accuracy and robustness, including under out-of-distribution conditions.

This paper introduces a novel approach, Context Optimization with Multi-Knowledge Representation (CoKnow), to enhance the performance of Vision-LLMs (VLMs) such as CLIP in downstream tasks. The primary focus is on addressing the limitation of current prompt learning methods, which often lack diversity in prompt templates, thereby restricting the potential of pre-trained VLMs. The authors posit that a single text prompt may not fully capture the complexity of an image, and propose enriching the prompt context by incorporating knowledge from multiple perspectives and abstraction levels, termed Multi-Knowledge Representation.

The authors define Multi-Knowledge as comprising three types: visual knowledge (VK), non-visual knowledge (NVK), and panoramic knowledge (PK). VK includes captions describing the image or its category. NVK incorporates more abstract-level knowledge beyond visual aspects. PK combines multi-level descriptions, such as both VK and NVK, into a comprehensive description. The authors leverage the LLM GPT-4 to automatically generate Multi-Knowledge Representation using a set of simple prompt templates.

The CoKnow framework consists of two key modules: a prompt optimizer guided by Multi-Knowledge representation and a lightweight semantic knowledge mapper. The prompt optimizer facilitates adaptive learning of prompt templates rich in domain knowledge. The semantic knowledge mapper generates Multi-Knowledge Representation from images without requiring additional input. The framework is designed to be plug-and-play compatible with VLMs beyond CLIP.

The method involves inputting three types of templates into the text encoder: learnable context (soft prompt), Multi-Knowledge, and hand-crafted templates. The image encoder outputs are processed through semantic knowledge mappers. Contrastive loss is calculated between the image embeddings and the target template embeddings. The original image representation and the mapped image representation are combined using a weighting parameter $\beta$ before undergoing contrastive loss calculation with the learnable contexts. The semantic knowledge mappers, $w_1$ and $w_2$ , are implemented as three-layer fully connected neural networks, where the hidden layer dimension is one-fourth of the input dimension, followed by a ReLU activation function.

During inference, the probability of an image $I$ belonging to category $i$ is calculated using the following equations:

$x = \beta \cdot x_0 + (\frac{1-\beta }{2} )\cdot x_1 + (\frac{1-\beta }{2} )\cdot x_2$

$x$ : combined image representation
$\beta$ : weight parameter
$x_0$ : output of the image encoder for the given image $I$
$x_1$ , $x_2$ : outputs of the semantic knowledge mappers $W$ for the given image $I$

$p(y = i | I) = \frac{\exp(\text{sim}(w_i, x) / \tau)}{\sum_{k=1}^{K} \exp(\text{sim}(w_k, x) / \tau)}$

$p(y = i | I)$ : probability of image $I$ belonging to category $i$
$w_i$ : output of the text encoder when the category is $i$
$\text{sim}(w_i, x)$ : cosine similarity between $w_i$ and $x$
$\tau$ : temperature parameter
$K$ : total number of categories

Experiments were conducted on 11 publicly available datasets, including ImageNet, Caltech, Oxford-Pets, Flowers, Food101, Stanford Cars, FGVC Aircraft, EuroSAT, UCF101, DTD, and SUN397. The authors followed the few-shot evaluation protocol outlined in CoOp, utilizing 1, 2, 4, 8, and 16 shots for training. ResNet-50 was used as the backbone architecture for the CLIP image encoder, with ViT-B/16 also evaluated. CoKnow consistently outperformed previous methods, demonstrating its effectiveness for prompt learning in VLMs. Specifically, the method demonstrates average top-1 accuracy rates surpassing CoOp across each dataset. With 4 shots, CoKnow's results approach those of CoOp with 8 shots and surpass Wise-FT's results.

Ablation studies were performed to analyze the impact of different Multi-Knowledge types (VK, NVK, PK) and the weighting parameter $\beta$ . The results indicated that PK generally provides the best performance. The authors also explored the impact of different context lengths and classname positions on the results. Additionally, the paper investigates the impact of using semantic knowledge mappers to map the image representations of the original CLIP. The experimental results indicate that the image representations of the original CLIP have a significant impact on Prompt Learning, especially when the training sample size is 1-shot.

The paper also evaluated the robustness of CoKnow under out-of-distribution (OOD) conditions, where prompt learning was conducted on ImageNet and tested on ImageNetV2. The results suggest that the method effectively generalizes to out-of-distribution datasets.

Markdown Report Issue