Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Published 21 Oct 2024 in cs.CV | (2410.16261v3)

Abstract: Multimodal LLMs (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a miniature multimodal model that achieves 90% of large-model performance using just 5% of the parameters.
It employs a compact vision encoder distilled from a larger model and integrates with pre-trained LLMs via an MLP projector for visual instruction tuning.
The unified adaptation framework reformulates diverse visual tasks into a VQA format, enabling efficient domain transfer and specialized applications.

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

The paper "Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance" introduces a series of lightweight Multimodal LLMs (MLLMs) designated as Mini-InternVL. These models promise significant efficiency improvements, achieving approximately 90% of the performance of much larger counterparts using only 5% of the parameters. This breakthrough allows for enhanced application in various specialized domains including autonomous driving, medical imaging, and remote sensing.

Architecture of Mini-InternVL

Mini-InternVL incorporates an innovative architecture which prominently features InternViT-300M as a compact vision encoder. This vision encoder, distilled from a larger InternViT-6B model, acquires world knowledge through a unique knowledge distillation process.

Figure 1: Training method and architecture of Mini-InternVL. Left: We employ InternViT-6B as the teacher model to perform knowledge distillation on the student model.

Three versions of the model exist, scaling from 1 billion to 4 billion parameters. They integrate the vision encoder with various pre-trained LLMs, relying on an MLP projector for seamless integration. Training involves two main stages: language-image alignment, followed by visual instruction tuning, to endow Mini-InternVL with robust multimodal understanding capabilities.

Unified Adaptation Framework

To facilitate easy domain transfer, the paper introduces a streamlined adaptation framework. This framework reformulates diverse visual tasks into a Visual Question Answering (VQA) format, simplifying model adjustments across various tasks such as image classification, region perception, and video-related tasks.

Figure 2: The data format of our adaptation framework. We formulate other visual tasks (i.e., image classification, region perception, multi-view image tasks, video-related tasks, and visual grounding) into VQA format in our framework.

The adaptation framework standardizes model architecture, data format, and training schedules, thus allowing swift domain adaptation with low marginal costs.

Transfer Learning and Domain Adaptation

The proposed transfer learning paradigm leverages the strengths of Mini-InternVL for specialized domains. By using a blend of domain-specific and general training data, these models exhibit remarkable performance and versatility across a vast array of tasks. The framework integrates data from multiple publicly available datasets, promoting robust model training and domain adaptation.

Figure 3: Qualitative Results of . In the upper left corner is a multi-view image from DriveLM-nuScenes version-1.1 after data processing. The model's outputs align with human driving behavior.

Performance Evaluation

Extensive testing across various benchmarks demonstrates that Mini-InternVL achieves impressive performance metrics, particularly excelling in tasks involving OCR and specialized image analysis. The study also conducts a thorough examination of different training mechanisms, evaluating their impacts on computational efficiency and model performance.

Figure 4: The impact of the proportion of general data.

For example, in autonomous driving scenarios, after domain adaptation, models exhibited enhanced perception, prediction, and planning capabilities. Additionally, the effects of different adaptation techniques and training sample sizes on end performance are analyzed, illustrating the nuanced trade-offs between performance and computational resource allocation.

Conclusion

The Mini-InternVL series addresses the increasing demand for efficient multimodal models capable of operating effectively within constrained computational resources. By unifying model adaptation processes and streamlining domain transfer, Mini-InternVL stands as a significant contribution to the practical deployment of MLLMs. The paper invites future exploration into expanding the capabilities and applications of such efficient models within other demanding domains.