AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs

Published 27 Feb 2025 in cs.CV | (2502.20035v1)

Abstract: Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal LLM (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AsymLoRA, an asymmetric LoRA architecture that uses task-specific B matrices to address conflicts and a shared A matrix to capture commonalities, outperforming both vanilla LoRA and LoRA-MoE in MLLM fine-tuning.
AsymLoRA achieves strong results on single-domain tasks, scoring 55.51% on TextVQA, enhancing MME (Perception: 1327.93, Cognition: 287.14), and reaching 59.60% accuracy on GQA.
AsymLoRA demonstrates superior performance in multi-domain settings, achieving 54.25% on TextVQA and 38.10% on VizWiz average, showcasing effective integration of textual and visual cues and adaptability.

AsymLoRA introduces a parameter-efficient tuning framework for MLLM instruction fine-tuning, designed to address the challenges of data conflicts and commonalities inherent in diverse image-text datasets.

The paper introduces AsymLoRA, an asymmetric LoRA architecture that uses task-specific B matrices to address conflicts and a shared A matrix to capture commonalities, outperforming both vanilla LoRA and LoRA-MoE in MLLM fine-tuning.
AsymLoRA achieves a TextVQA score of 55.51% and enhanced MME scores (Perception: 1327.93, Cognition: 287.14) on single-domain conversation tasks, demonstrating improved multimodal reasoning and feature extraction capabilities, and attains the highest accuracy (59.60%) on GQA while minimizing distribution shift (1.50).
Experimental results across single-domain and multi-domain settings demonstrate AsymLoRA's superior performance in integrating textual and visual cues, handling diverse multimodal challenges, and adapting dynamically to different domains while preserving effective knowledge transfer, achieving a TextVQA score of 54.25% and VizWiz average of 38.10% in multi-task settings.