Imp: Highly Capable Large Multimodal Models for Mobile Devices

Published 20 May 2024 in cs.CV and cs.CL | (2405.12107v2)

Abstract: By harnessing the capabilities of LLMs, recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.

Abstract PDF HTML Upgrade to Chat

References (80)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces Imp, a family of lightweight LMMs that achieve high performance with reduced computational demands for mobile devices.
It details innovative use of smaller LLMs, advanced visual encoders like SigLIP, and efficient LoRA finetuning to optimize model design.
Results demonstrate that Imp models excel in multilingual tasks, OCR challenges, and real-time mobile deployment, offering practical AI solutions.

Journey Towards Lightweight Large Multimodal Models (LMMs)

Introduction to Lightweight LMMs

When it comes to developing complex AI systems, the size and compute requirements often become significant stumbling blocks. LLMs like GPT-4 and Gemini-1.5 have pushed the boundaries of AI capabilities, yet they're computationally intensive. Researchers are increasingly exploring Large Multimodal Models (LMMs) that combine various types of data (like text and images) to achieve even more complex functionalities. However, these LMMs often come with heavy computational demands.

The paper introduces a new family of LMMs called Imp, designed to be both effective and lightweight. These models aim to strike a balance between maintaining high performance and reducing computational overhead, making them feasible for deployment on everyday devices like mobile phones.

Key Design Choices

The key to building these lightweight models lies in careful design choices across model architecture, training strategy, and training data. Here's a breakdown of how these choices come together.

Model Architecture

Choice of LLM:

The Imp models start by selecting smaller but effective LLMs, such as Phi-2 (2.7B parameters) and MobileLLaMA (2.7B parameters).
Phi-2 outperformed MobileLLaMA significantly, primarily because of its high-quality training dataset.

Choice of Visual Encoder:

Most LMMs use visual encoders based on models like CLIP. The Imp models experimented with several visual encoders, including the SigLIP model, which performed best due to its extensive training on image-text pairs.
With the SigLIP visual encoder, Imp LMMs achieve superior performance at a smaller computational scale compared to their counterparts.

Training Strategy

Finetuning Mechanism:

The researchers found that LoRA finetuning outperformed traditional full-parameter finetuning. Specifically, a LoRA rank of 256 offered the best balance between model capability and resource efficiency.

Training Epochs:

Training for just one epoch often left the model under-optimized. Instead, training for two epochs provided a notable boost in performance without a significant increase in computational requirements.

Enhanced Training Data

OCR and Chart Understanding:

Introducing data from datasets like DVQA and ChartQA, which focus on OCR (Optical Character Recognition) and chart understanding, showed marked improvement in the model's ability to handle tasks requiring text recognition within images.

GPT-4V Annotated Data:

Incorporating GPT-4V annotated datasets helped in fine-tuning the LMM’s capabilities to better generate instructions and engage in conversations, significantly bolstering the model's overall performance.

Results and Comparisons

The paper showcases various Imp models (Imp-2B, Imp-3B, and Imp-4B). Let’s explore some notable results:

Imp-3B model managed to outperform many existing 7B and even 13B parameter models across several benchmarks.
Imp-2B particularly excelled in multilingual understanding, showing robust performance in Chinese text despite being trained primarily on English data.
The Imp-4B model combined all the improvements and delivered strong results across a multitude of benchmarks, thereby proving the viability of small yet potent LMMs.

Deployment on Mobile Devices

One of the major advantages of these lightweight Imp models is their deployability on mobile devices. Using techniques like low-bit quantization, the researchers optimized Imp-3B to run efficiently even on devices powered by Snapdragon chips.

Performance and Speed:
- On mobile devices, the model demonstrated high inference speeds, making real-time applications plausible.
- Reducing the image resolution did not significantly impact the overall performance, ensuring a good balance between latency and model capability.

Practical Implications and Future Work

The Imp models lay down a promising path for deploying high-performance AI in resource-constrained environments such as mobile devices and edge computing. This makes advanced AI accessible to a broader range of applications, including personal assistants, real-time translation services, and more.

Looking Forward

Future improvements could involve:

Introducing more diverse and high-quality datasets to further refine model capabilities.
Implementing advanced training strategies like knowledge distillation.
Exploring more efficient model compression techniques.
Extending support for additional input modalities such as audio and 3D data.

The researchers are also focusing on practical deployments and have developed ImpChat, a multi-platform assistant leveraging these lightweight models. This ensures that you can have a robust AI assistant across various devices without the need for extensive resources.

As we move forward, continued efforts to refine these lightweight yet powerful models could lead to a broader, more inclusive application of AI technologies.