HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Published 17 Feb 2025 in cs.CV | (2502.12148v1)

Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal LLMs (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

Abstract PDF Upgrade to Chat

Summary

Analyzing HermesFlow: Enhancing Multimodal Large Language Models

This paper introduces HermesFlow, a framework designed to enhance Multimodal Large Language Models (MLLMs) by bridging the existing gap between their understanding and generation capabilities. Existing MLLMs like Show-o, Transfusion, and Emu3, while highly proficient in understanding tasks, often show comparatively weak performance in generation tasks. The authors of the paper propose HermesFlow to address this imbalance and enhance the overall capabilities of MLLMs.

Insightful Observations

The authors identify a consistent phenomenon where the understanding capabilities of MLLMs outperform their generation abilities across several models, such as VILA-U, Janus, and Show-o. This understanding-generation gap has been an impediment to the balanced function of these models. Importantly, the issue does not simply resolve by increasing the quantity of training data; more sophisticated alignment strategies are required. The paper argues for a structured approach to aligning understanding and generation using homologous preference data.

Approach and Methodology

HermesFlow innovatively implements a Pair-DPO (Direct Preference Optimization) framework that employs homologous input data, capturing both understanding and generation preferences. The framework advances through several rounds of self-play iterative optimization, progressively refining an MLLM’s performance until the gap between understanding and generation is significantly reduced.

The curation of homologous preference data begins with generating potential captions and images from an MLLM, followed by selecting preferred outcomes based on predefined criteria, such as BERT similarity scores and self-VQA (Visual Question Answering) scores. This process ensures that both understanding and generation are optimized concurrently, leveraging insights from each domain to benefit the other.

Empirical Evidence

The evidence supporting the efficacy of HermesFlow is detailed through comprehensive experiments. The approach yields improvements over prior systems along various metrics, demonstrating proficiency on understanding benchmarks like POPE and MME, and generation benchmarks such as GenEval. Quantitative comparisons indicate HermesFlow’s superiority, showcasing a reduction in the performance gap from models like Show-o, by a significant margin. For example, while the understanding and generation gap in Show-o was measured to be 0.087, HermesFlow reduced it to 0.036.

Implications and Future Work

HermesFlow holds promise not only as a framework for enhancing MLLMs but potentially as a foundational alignment strategy for future multimodal models, addressing the current limitations of isolated task improvements. By maintaining balance and fostering synergy between understanding and generation, HermesFlow could play a crucial role in the development of more holistic MLLMs.

The authors also acknowledge existing limitations, such as the need for broader application across different backbone models, suggesting areas for future research. Expanding the range of MLLMs that benefit from HermesFlow might incorporate more diverse data types and problem formulations, potentially boosting the framework's generality and effectiveness.

In summary, HermesFlow presents an effective alignment framework that promises to alleviate the imbalance between multimodal understanding and generation, offering substantial contributions to the field of AI and multimodal technologies.