OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Published 25 Feb 2025 in cs.CV | (2502.18411v2)

Abstract: Recent advancements in open-source multi-modal LLMs (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.

Abstract PDF Upgrade to Chat

Summary

The paper introduces OmniAlign-V, a dataset with over 200K samples enhancing MLLM alignment with human preferences through diverse image-question pairs.
It presents MM-AlignBench, a benchmark specifically designed to evaluate MLLMs' alignment capabilities with human preferences using human-annotated samples.
Empirical findings show fine-tuning MLLMs with OmniAlign-V significantly enhances human preference alignment while maintaining performance on other tasks.

Overview of "OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference"

The research paper entitled "OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference" critically discusses the disparity between the foundational capabilities of open-source Multi-Modal LLMs (MLLMs) and their alignment with human preferences. This study introduces OmniAlign-V, a novel dataset specifically designed to bridge this gap by enhancing the alignment of MLLMs with human values and preferences. Additionally, the paper sets forth MM-AlignBench, a benchmark aimed at rigorously evaluating the effectiveness of MLLMs in aligning with human values.

The authors underscore a vital observation: while MLLMs have achieved performance parity with proprietary models concerning objective tasks such as object recognition and OCR, they fall short in human preference alignment. This shortcoming significantly impairs the user experience during multi-modal conversational interactions. To address this, the paper introduces OmniAlign-V, a dataset containing over 200K samples of curated images paired with open-ended and comprehensive question-answer pairs. This dataset aids in refining the human alignment aspect of MLLMs without compromising their intrinsic capabilities measured on standard Visual Question Answering (VQA) benchmarks.

Key Contributions

The paper presents several contributions that are noteworthy:

Comprehensive Dataset Creation: OmniAlign-V, a dataset comprising over 200,000 samples, enhances MLLMs' alignment with human preferences by incorporating diverse images paired with complex questions and responses. This dataset is characterized by open-ended questions, diverse topics, and varied response formats.
Development of a Benchmark: The MM-AlignBench is designed to evaluate MLLMs' alignment capabilities specifically with human preferences. Comprising high-quality, human-annotated samples, it emphasizes training models that can better understand and align with human values.
Empirical Findings on Human Alignment Performance: The study establishes that fine-tuning MLLMs using either Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) with OmniAlign-V significantly enhances alignment with human preferences. This is validated without adverse effects on other performance metrics.
In-depth Examination of Current Shortcomings: By conducting a preliminary study, the authors identify the critical degradation of alignment capabilities in MLLMs when compared to traditional LLMs, postulate reasons for this, and explore potential remedies through specialized datasets.

Implications and Future Directions

The implications of enhancing MLLM alignment are both practical and theoretical. On a practical level, improved human alignment means better user interaction, leading to more effective deployment of MLLMs in real-world applications where human communication style is crucial. Theoretically, it opens new avenues in AI research focused on increasing the contextual understanding and empathetic response generation of AI systems.

Furthermore, the proposed methods and insights lay groundwork for future research into improving multi-modal models. By addressing gaps in data preparation and fine-tuning processes, future developments might explore scaling OmniAlign-V or similar datasets to achieve even broader human-AI interaction alignment. The adoption of advanced algorithms integrating both multi-modal and language-specific data streams stands as a promising research trajectory aimed at refining the ability of AI systems to autonomously assimilate and align with diverse sets of human values and preferences.

In summary, "OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference" delineates a clear pathway towards the goal of achieving better human preference alignment in MLLMs. The introduction of specialized datasets and benchmarks marks significant progress, potentially leading to more effective and human-centric AI systems in the future.

Markdown Report Issue