MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Published 14 Feb 2025 in cs.CL and cs.CV | (2502.10391v1)

Abstract: Despite notable advancements in Multimodal LLMs (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel benchmark that evaluates multimodal LLMs using high-resolution, complex real-world scenarios.
It curates an extensive dataset with over 13,000 high-resolution images and 29,429 annotated Q-A pairs across 43 subtasks to ensure rigorous testing.
The findings reveal that even advanced models struggle, scoring below 60% accuracy, underscoring the need for improved model architecture and training techniques.

Overview of "MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?"

The paper introduces MME-RealWorld, a comprehensive benchmark designed to address several limitations in the evaluation of Multimodal LLMs (MLLMs). The authors have identified key areas where existing benchmarks fall short, particularly in reflecting the real-world challenges faced by MLLMs in critical applications. This benchmark emphasizes high-resolution scenarios that are difficult for humans, thereby presenting a formidable test for current MLLM capabilities.

Objectives and Methodology

The study primarily aims to create a more robust benchmark for evaluating MLLMs, focusing on:

Data Scale: The authors address the issue of performance variance due to limited data by curating an extensive dataset. Over 300,000 images were sourced, with 13,366 high-resolution images selected for annotation. This resulted in 29,429 question-answer pairs across 43 subtasks in five real-world scenarios.
Annotation Quality: Challenges with model-based annotations are mitigated by employing professional annotators and experts. This ensures high-quality, challenging questions that even humans find difficult.
Task Difficulty: To truly assess model capability, the authors introduce tasks with high-resolution images and complex scenarios. The benchmark includes various domains like autonomous driving, remote sensing, and video surveillance.

The paper also presents a Chinese counterpart, MME-RealWorld-CN, recognizing the importance of native language contexts in global AI applications.

Evaluation and Results

The paper assesses 28 prominent MLLMs, including GPT-4o and Claude 3.5 Sonnet. The results are rather telling; even the most advanced models did not exceed 60% accuracy on the benchmark. This underscores the challenges in understanding high-resolution images and complex real-world scenarios.

The study reveals that models focusing on high-resolution input, such as Mini-Gemini-HD and SliME, tend to outperform conventional models that do not accommodate such granularity. Interestingly, proprietary models like GPT-4o and Claude 3.5 Sonnet, while performing well in OCR tasks, struggle significantly with tasks involving nuanced and complex real-world interpretation.

Implications and Future Directions

Practically, the findings emphasize the need for further innovations in model architectures and training techniques that accommodate high-resolution and complex real-world data. Theoretically, the paper suggests a considerable gap in perceptual and reasoning abilities compared to human intelligence, highlighting areas for future AI development.

The introduction of MME-RealWorld and its Chinese version marks a critical step for future research in MLLM evaluation, pushing the boundaries on how these models understand and process real-world information. As advancements in AI continue, benchmarks like MME-RealWorld will be pivotal in evaluating and shaping the capabilities of future MLLMs. The study paves the way for more sophisticated methods to not only address current limitations but also anticipate the requirements of future AI systems in diverse, challenging environments.

Markdown Report Issue