Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Published 24 May 2023 in cs.CV | (2305.15023v3)

Abstract: Recently, growing interest has been aroused in extending the multimodal capability of LLMs, e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and LLMs. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at https://luogen1996.github.io/lavin.

Abstract PDF Upgrade to Chat

Citations (73)

View on Semantic Scholar

Summary

The paper introduces the Mixture-of-Modality Adaptation (MMA) that reduces parameters while enabling efficient vision-language instruction tuning.
It employs dynamic modality routing to effectively handle both unimodal and multimodal tasks by leveraging lightweight adapter modules.
Applying MMA to LLaMA creates the LaVIN model, which achieves 90.50 accuracy on ScienceQA with significantly reduced training resources.

Mixture-of-Modality Adaptation for Vision-Language Instruction Tuning

The paper "Cheap and Quick: Efficient Vision-Language Instruction Tuning for LLMs" introduces an improved methodology for augmenting LLMs with multimodal capabilities, specifically focusing on vision-language (VL) tasks. The proposed approach, termed Mixture-of-Modality Adaptation (MMA), seeks to enhance training efficiency while maintaining NLP capabilities.

Core Innovations

Mixture-of-Modality Adaptation (MMA): Unlike traditional methods that rely on large-scale VL pre-training, MMA employs lightweight adapter modules to bridge image encoders and LLMs. This reduces the parameter count significantly, facilitating a cost-efficient training process.
Dynamic Adaptation via Modality Routing: A key feature of MMA is its routing mechanism, which dynamically selects pathways based on input modality. This ensures effective handling of both unimodal and multimodal instructions, preserving the inherent strengths of LLMs in NLP tasks.
LaVIN Model: By applying MMA to LLaMA, the authors construct LaVIN, a model demonstrating competitive performance against existing multimodal LLMs. The architecture respects parameter-efficient tuning paradigms while achieving notable reasoning capabilities in diverse instruction-following tasks.

Experimental Validation

The paper provides comprehensive empirical evidence of LaVIN's efficiency and effectiveness:

ScienceQA Performance: LaVIN exhibits superior accuracy in ScienceQA benchmarks, with measurable improvements in training time and storage requirements. For instance, LaVIN-13B achieves an average accuracy of 90.50, closely rivaling advanced models like LLaVA, but with substantially reduced computational overhead.
Zero-shot and Fine-tuning Results: On benchmarks such as TruthfulQA and MME, LaVIN's zero-shot performance illustrates robust generalization. Its alignment with pre-trained vision encoders leads to holistic improvement across tasks, including image captioning on COCO without requiring extensive pre-training.

Implications and Future Directions

The MMA approach has significant implications for the development of resource-efficient multimodal LLMs. By avoiding large updates to the pre-trained models, MMA not only economizes training but also retains NLP proficiency. This positions it as a viable roadmap for scalable AI deployment in varied real-world applications.

Future research could expand on dynamically adaptive architectures like MMA to encompass broader modalities and explore integration in more complex systems. Additionally, the paper hints at the potential for further reducing the model footprint without compromising performance, an area ripe for exploration given increasing demands for sustainable AI.

Conclusion

The paper successfully demonstrates that through innovative architectural modifications, LLMs can be adapted to multimodal tasks in a resource-conscious manner. The proposed MMA methodology sets a benchmark for efficient model design, enabling LLMs to step into domains previously constrained by computational and financial limits. The introduction of LaVIN serves as a testament to the model's ingenuity, effectiveness, and potential to influence future AI advancements.