Papers
Topics
Authors
Recent
Search
2000 character limit reached

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Published 16 Nov 2024 in cs.CV | (2411.10669v1)

Abstract: As the research of Multimodal LLMs (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.

Summary

  • The paper presents a novel MoE architecture with LoRA adaptations that effectively mitigates multi-task conflict in multimodal LLMs.
  • It employs a unique instance-level routing strategy and freezes the base model to drastically reduce training and inference costs.
  • Experimental results demonstrate state-of-the-art performance on benchmarks, excelling in both visual perception and reasoning tasks.

Awaker2.5-VL: Addressing Multi-Task Conflict in Multimodal LLMs through a Mixture of Experts Architecture

The paper presents Awaker2.5-VL, a Multimodal LLM (MLLM) that employs a Mixture of Experts (MoE) architecture to tackle the challenges associated with handling diverse textual and visual tasks. As a response to the prevalent "multi-task conflict" issue—wherein performance degrades across various tasks due to the heterogeneity in data representation and distribution—the authors propose a method to enhance the task-specific capabilities of MLLMs.

Methodology and Model Architecture

Awaker2.5-VL utilizes a Mixture of Experts (MoE) architecture characterized by multiple sparsely activated experts. This design aims to provide task-specific abilities to the model by dynamically activating and deactivating experts through a gating network. Importantly, a global expert remains active throughout, ensuring the model retains its versatility. Each expert is constructed using a low-rank adaptation (LoRA) structure, which is instrumental in reducing the training and inference costs due to MoE's sparse nature.

The training strategy is built around freezing the base model during the learning of MoE and LoRA modules, a measure that significantly reduces training costs. The authors employ a novel routing strategy that simplifies typical MoE structures in LLMs by utilizing instance-level activation rather than token-level activation.

Experimental Evaluation and Results

The experimental results are compelling, with Awaker2.5-VL exhibiting state-of-the-art performance across several recent benchmarks. Specifically, it achieves superior results on the MME-Realworld and MMBench benchmarks, demonstrating significant improvements over other models, including its base model, Qwen2-VL-7B-Instruct. Notably, in Chinese-language benchmarks such as MME-Realworld-CN, Awaker2.5-VL outperforms other models in both perception and reasoning tasks. This underscores the effectiveness of the proposed MoE strategy in managing multimodal tasks.

Implications and Future Work

The introduction of Awaker2.5-VL marks an incremental advancement in the domain of MLLMs by addressing the multi-task conflict through an MoE architecture. The authors highlight potential improvements in the routing process, suggesting advancements in prompt representation for enhanced performance. Further exploration into integrating the MoE architecture within the Vision Transformer (ViT) aspect of the model is also envisaged.

These improvements may offer insights into the broader implications for AI development, particularly in optimizing performance across diverse datasets through efficient parameter utilization. As research evolves, the methods proposed could serve as a foundational strategy for developing cost-effective, scalable models addressing heterogeneous task requirements in multimodal AI systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 3 tweets with 65 likes about this paper.