MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Published 28 Dec 2023 in cs.CV | (2312.16886v2)

Abstract: We present MobileVLM, a competent multimodal vision LLM (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of LLMs at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Abstract PDF HTML Upgrade to Chat

References (133)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces MobileVLM, a model that efficiently integrates lightweight vision encoders and mobile-adapted language models for rapid inference on resource-constrained devices.
It employs a novel Lightweight Downsample Projector to compress visual tokens, achieving 21.5 tokens/s on Snapdragon and 65.3 tokens/s on Jetson Orin without significant loss in accuracy.
The work demonstrates that optimizing multimodal AI for mobile environments can enhance practical applications and drive further innovations in efficient model design.

Insights into MobileVLM: A Vision LLM for Mobile Devices

The paper "MobileVLM: A Fast, Strong, and Open Vision Language Assistant for Mobile Devices" presents a groundbreaking approach for deploying multimodal vision LLMs (VLMs) on resource-constrained platforms. MobileVLM is crafted to balance high performance with efficient resource utilization, making it suitable for mobile and IoT devices.

Key Contributions and Design

MobileVLM distinguishes itself by integrating lightweight yet powerful components, optimized for mobile environments. It includes:

Efficient Vision Encoder: Utilizing the CLIP ViT-L/14, this model harnesses natural language supervision to achieve robust visual feature extraction, enhancing tasks such as visual question answering and image captioning.
Mobile-tailored LLMs: Known as MobileLLaMA, these models are scaled versions of LLaMA with sizes of 1.4B and 2.7B parameters, adapted for mobile environments. They employ efficient architectural modifications, including RoPE for positional encoding and RMSNorm for stable training, contributing to their quick inference capabilities.
Lightweight Downsample Projector (LDP): This novel component aligns visual features with the word embedding space, reducing the number of visual tokens and hence, the computational load without significant performance loss.

Performance and Evaluation

MobileVLM exhibits competitive results on various VLM benchmarks despite its reduced computational footprint. Notably, it achieves commendable inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon CPU and 65.3 tokens/s on an NVIDIA Jeston Orin GPU. The model outperforms many larger models in tasks such as general question answering and visual reasoning.

In the latency analysis, MobileVLM demonstrates superior performance on both mobile and IoT devices compared to peers like OpenLLaMA and TinyLLaMA, proving its suitability for real-world applications.

Future Directions and Implications

The design decisions in MobileVLM indicate a shift towards deploying sophisticated AI models in resource-limited scenarios, expanding the applicability of AI in mobile and edge computing environments. This work prompts further exploration into model compression and efficiency techniques, potentially influencing future research in mobile AI deployment.

Researchers might further investigate optimizing neural architecture search for LLMs, exploring more efficient training paradigms, and expanding the use of high-quality datasets for better alignment of multimodal tasks.

Conclusion

MobileVLM sets a precedent in reducing the barriers to deploying VLMs on mobile and low-power devices. By maintaining a balance between performance and efficiency, this paper contributes significantly to the field of AI, specifically in enhancing the reach of intelligent systems in everyday mobile applications. This work is poised to advance the implementation of vision-language capabilities in diverse, real-world scenarios.

Markdown Report Issue