EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Published 16 May 2025 in cs.CV and cs.CL | (2505.11405v1)

Abstract: Emotion understanding is a critical yet challenging task. Recent advances in Multimodal LLMs (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at https://github.com/xxtars/EmotionHallucer.

Abstract PDF Upgrade to Chat

Summary

Evaluating Emotion Hallucinations in Multimodal Large Language Models

The paper titled "E: Evaluating Emotion Hallucinations in Multimodal Large Language Models" introduces a significant contribution to the study of generative Artificial Intelligence regarding its reliability in emotion perception tasks. Drawing attention to the phenomenon of "hallucinations," where models generate implausible or incoherent content contrary to expected outputs, the research seeks to evaluate and categorize these hallucinations specifically within the domain of emotion understanding. As an area of significant theoretical and practical importance, understanding how Multimodal Large Language Models (MLLMs) interpret emotion is increasingly critical for AI application development across numerous fields.

Despite the observed capabilities of MLLMs in integrating multimodal inputs and producing coherent outputs, the paper outlines their propensity for hallucination, particularly in emotion-related contexts. The authors address these concerns by introducing EmotionHallucer, the first benchmark designed explicitly to evaluate emotion hallucinations. Through this benchmark, the research probes into two primary aspects: hallucinations related to emotion psychology knowledge and those within real-world multimodal emotion understanding.

The framework proposed—and examined through the evaluation of 38 models—reveals troubling inadequacies in contemporary models. It finds that most current models suffer significantly from hallucination issues. Additionally, it notes that closed-source models tend to outperform open-source ones, and employing reasoning capabilities provides a tangible further advantage for these models. Conversely, existing models generally demonstrate stronger performance in processing emotion psychology knowledge compared to multimodal emotion perception tasks.

In a pragmatic bid to address the identified shortcomings, the paper introduces the PEP-MEK framework (Predict-Explain-Predict with Modality and Emotion Knowledge), a plug-and-play modification shown to enhance hallucination detection by an average of 9.90% across selected models. This method enriches modality-specific and emotional knowledge integration, offering promising implications for more accurate emotion hallucination detection through structured understanding and layered reasoning.

The implications of this research are noteworthy for both theoretical advancements in AI as well as practical applications in all fields where human emotion understanding is pertinent. Models capable of reliably interpreting emotional expressions can revolutionize domains such as mental health diagnosis systems, user-centric service applications, and emotion-sensitive AI communication interfaces. Furthermore, understanding the mechanics and missteps of current systems offers vital insights for future developments in AI, stressing the need for emotionally intelligent and culturally calibrated models.

In conclusion, this paper foregrounds the urgency of refining MLLMs to mitigate hallucinations in emotion recognition tasks. Moving forward, developing robust evaluation metrics and frameworks for improving MLLM emotional reasoning could yield AI models capable of nuanced interpretations of human emotion, aiding in their incorporation into complex human-centered systems and applications.