VladVA: Discriminative Fine-tuning of LVLMs

Published 5 Dec 2024 in cs.CV and cs.AI | (2412.04378v3)

Abstract: Contrastively-trained Vision-LLMs (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-LLMs (LVLMs), which combine vision encoders with LLMs, have been shown to be capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include (1) a carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components; (2) a parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters; (3) significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a novel framework that transforms generative LVLMs into discriminative models by integrating contrastive and next-token prediction losses.
Methodology employs parameter-efficient adaptation using soft prompting and LoRA adapters on diversified image-text pairs.
Results reveal significant gains with improvements of 4.7%–7.0% in retrieval tasks and up to 15% in compositional benchmarks.

Discriminative Fine-tuning of LVLMs: Enhancing Vision-Language Capabilities

Summary

The paper "Discriminative Fine-tuning of LVLMs" explores the limitations of current Vision-LLMs (VLMs) like CLIP, which, despite exhibiting robust zero-shot abilities, often suffer from a "bag of words" behavior, lacking advanced language comprehension and compositional understanding. Furthermore, when these VLMs are integrated with LLMs to form LVLMs, their autoregressive nature renders them less efficient for discriminative tasks. This research proposes a novel methodology that transforms a generative LVLM into a more capable discriminative model, maintaining its rich compositional capabilities.

Methodology

The paper introduces a carefully orchestrated training and optimization framework, aiming to amalgamate the strengths of both contrastive and autoregressive methods. The proposed framework, dubbed VladVA: Vision-Language Adaptation for Discriminative Visual Assistant, utilizes image-text pairs that vary in length and granularity, training the model using both contrastive and next-token prediction losses. The fine-tuning process is enhanced using a parameter-efficient adaptation strategy that combines soft prompting and LoRA adapters.

Key Contributions:

Data Utilization: The framework considers both short and long captions to create a diversified training dataset that helps in overcoming the shortcomings of solely contrastive or autoregressive models.
Training Strategy: The combination of contrastive loss and next-token prediction loss enables fine-tuning that retains the strengths of the original LVLM while enhancing its discriminative capabilities.
Parameter-Efficient Adaptation: By leveraging soft prompting and LoRA adapters, the approach maintains efficiency and scalability, crucial for handling large LVLMs.

Results and Implications

The proposed method demonstrates significant improvements over state-of-the-art CLIP-like models across standard image-text retrieval benchmarks, with impressive gains ranging from 4.7% to 7.0% in absolute terms. Furthermore, the method shows notable advancements in compositionality benchmarks, achieving up to 15% improvements over the competing models. The research also reveals that, contrary to recent findings, contrastive image-text fine-tuning can be beneficial when appropriately incorporated into the training regime of LVLMs.

The theoretical implications suggest a successful blend of generative and discriminative capabilities within a single unified LVLM architecture. Practically, this approach could herald improvements in applications requiring nuanced vision-language understanding, such as complex image retrieval and question-answering systems in multi-modal contexts.

Future Directions

The paper leaves room for future exploration into extending the methodology to even larger and more varied datasets, potentially incorporating more advanced forms of representation learning and further exploring the efficient scaling of model size. Additionally, there is potential to refine the balance between parameter efficiency and model capacity further, ensuring that the adaptation techniques remain practical as model architectures evolve.

In summary, this work represents a significant step towards addressing the limitations of current LVLMs by enhancing their discriminative performance without sacrificing the compositional strengths provided by their generative underpinnings. The proposed framework could serve as a blueprint for future advancements in the integration of LLMs with vision tasks.