Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment

Published 30 Dec 2024 in eess.AS, cs.CL, and cs.SD | (2412.20821v1)

Abstract: Multimodal emotion recognition (MER), leveraging speech and text, has emerged as a pivotal domain within human-computer interaction, demanding sophisticated methods for effective multimodal integration. The challenge of aligning features across these modalities is significant, with most existing approaches adopting a singular alignment strategy. Such a narrow focus not only limits model performance but also fails to address the complexity and ambiguity inherent in emotional expressions. In response, this paper introduces a Multi-Granularity Cross-Modal Alignment (MGCMA) framework, distinguished by its comprehensive approach encompassing distribution-based, instance-based, and token-based alignment modules. This framework enables a multi-level perception of emotional information across modalities. Our experiments on IEMOCAP demonstrate that our proposed method outperforms current state-of-the-art techniques.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a Multi-Granularity Cross-Modal Alignment framework that leverages distribution, token, and instance alignment to improve multimodal emotion recognition.
It utilizes Wav2Vec 2.0 and BERT for feature extraction and achieves state-of-the-art performance on the IEMOCAP dataset with WA of 78.87% and UA of 80.24%.
Ablation studies demonstrate that each alignment module contributes uniquely to handling emotional ambiguity, enhancing human-computer interaction.

Introduction

The domain of Multimodal Emotion Recognition (MER) integrates speech and text modalities to improve human-computer interaction. Traditional approaches in MER focus on unimodal systems, but to capture the full spectrum of emotional cues, a multimodal strategy is essential. Recent methods often fall short in aligning the diverse features from multiple modalities, typically employing a single-level alignment strategy, which limits performance. This paper introduces the Multi-Granularity Cross-Modal Alignment (MGCMA) framework, designed to address these challenges by employing distribution-based, instance-based, and token-based alignment modules.

The MGCMA framework enhances multimodal alignment by integrating multiple alignment strategies that operate at different levels of granularity. The framework consists of three key modules: distribution-based alignment, token-based alignment, and instance-based alignment, alongside a feature extractor.

Figure 1: Overview of our proposed Multi-Granularity Cross-Modal Alignment (MGCMA) framework which comprises distribution-based, token-based, instance-based alignment modules, and a feature extractor.

Feature Extraction

High-level representations of the speech and text inputs are obtained using Wav2Vec 2.0 for audio and BERT for text sequences. These models provide a foundational encoding, critical for subsequent alignment processes.

Distribution-Based Alignment

Distribution-based alignment utilizes a constructor that forms multivariate Gaussian distributions of feature representations, thereby facilitating a higher-dimensional and coarse-grained alignment. By employing distribution-level contrastive learning, the module improves the robustness against the inherent ambiguity of emotional expressions.

Figure 2: The structure of Distribution Constructor. Activation layers and normalization layers are omitted in the diagram.

Token-Based Alignment

Token-based alignment focuses on achieving fine-grained alignment through self-attention and cross-attention mechanisms, which allow for effective local interaction between modalities. This granularity enhances the exchange of emotional cues detected within speech and text.

Instance-Based Alignment

Instance-based alignment is implemented using contrastive learning to refine the correlation between specific speech-text pairs. This focuses on strengthening the mapping relationships after distribution and token-based alignments have been achieved, ensuring a holistic understanding across modalities.

Experimental Evaluation

The MGCMA framework was evaluated on the IEMOCAP dataset. Performance metrics include Weighted Accuracy (WA) and Unweighted Accuracy (UA).

Comparison with SOTA Methods

The proposed framework outperformed existing state-of-the-art MER methods. MGCMA achieved WA of 78.87% and UA of 80.24%, evidencing the effectiveness of the multi-level alignment approach.

Ablation Studies

A series of ablation studies highlight the individual contributions of each alignment module. Results indicate that the distribution-based alignment contributes significantly to enhancing overall performance by supporting token and instance alignment strategies.

Future Implications

This framework's ability to handle multimodal emotional ambiguity and maintain high accuracy demonstrates its potential in a wide range of practical applications in human-computer interaction systems. Future work could focus on optimizing computational efficiency without sacrificing the nuanced emotional insights provided by the multiscale alignment strategies.

Conclusion

The proposed MGCMA framework sets a new benchmark for MER by integrating distribution-based, token-based, and instance-based alignment modules to address the complexity of multimodal emotional expression. Its superior performance on the IEMOCAP dataset underscores the necessity of multi-grained alignment strategies, likely inspiring further research into adaptive, scalable emotion recognition systems.

Markdown Report Issue