Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Published 28 Oct 2022 in cs.MM and cs.AI | (2210.15824v3)

Abstract: Modality representation learning is an important problem for multimodal sentiment analysis (MSA), since the highly distinguishable representations can contribute to improving the analysis effect. Previous works of MSA have usually focused on multimodal fusion strategies, and the deep study of modal representation learning was given less attention. Recently, contrastive learning has been confirmed effective at endowing the learned representation with stronger discriminate ability. Inspired by this, we explore the improvement approaches of modality representation with contrastive learning in this study. To this end, we devise a three-stages framework with multi-view contrastive learning to refine representations for the specific objectives. At the first stage, for the improvement of unimodal representations, we employ the supervised contrastive learning to pull samples within the same class together while the other samples are pushed apart. At the second stage, a self-supervised contrastive learning is designed for the improvement of the distilled unimodal representations after cross-modal interaction. At last, we leverage again the supervised contrastive learning to enhance the fused multimodal representation. After all the contrast trainings, we next achieve the classification task based on frozen representations. We conduct experiments on three open datasets, and results show the advance of our model.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a multi-view contrastive learning framework for refining modality-specific representations in multimodal sentiment analysis.
It combines supervised and self-supervised contrastive methods to enhance feature clustering and multimodal fusion efficiency.
Experiments on CMU-MOSI, MOSEI, and CH-SIMS datasets show significant improvements in sentiment classification accuracy and correlation metrics.

Improving Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

The paper explores modality representation learning in multimodal sentiment analysis (MSA) using a multi-view contrastive learning framework. This approach aims to enhance the discriminative power of modality-specific representations, addressing the limitations associated with conventional multimodal fusion methods.

Introduction to Multimodal Sentiment Analysis

Sentiment analysis is a crucial task in NLP, focusing on understanding emotional states through data. Traditional approaches centered on textual data have been extended to multimodal sentiment analysis, leveraging richer datasets that include spoken words, visual attributes, and acoustic features. Multimodal data offers a comprehensive understanding of emotions, especially when textual information is ambiguous.

Existing MSA methods largely focus on multimodal fusion. Early-fusion techniques combine features from various modalities immediately, while late-fusion strategies integrate results post independent modality training. Despite their efficacy, these strategies have yet to fully exploit cross-sample relations that could significantly boost representation distinguishability.

Multi-View Contrastive Learning Framework

Contrastive learning has notably improved representation quality by distinguishing similar and dissimilar samples in a shared embedding space. This paper introduces a three-stage framework to refine modality-specific representations:

Supervised Contrastive Learning (SCL-1): Initial features are extracted using pre-trained models; these features are refined by pulling samples together within the same class and pushing others apart, enhancing their semantic cohesion.
Self-Supervised Contrastive Learning (SSCL): Utilizing cross-modal Transformers, modality-specific representations are further refined. Self-supervised pairs are formed from different refinement orders to minimize representation gaps, thus enhancing modality interaction.
Supervised Contrastive Learning (SCL-2): Final multimodal representations are achieved by self-attention mechanisms, ensuring a robust discriminative representation for sentiment classification tasks.
Figure 1: The overview of our proposed method.

Methodology

The framework combines supervised and self-supervised contrastive learning to enhance modality-specific and multimodal representations. The process involves extracting initial modality features, applying Transformers for temporal and spatial information, and implementing contrastive learning strategies to refine and fuse representations effectively.

Feature Extraction

Video: Spatial and temporal features extracted using MTCNN and I3D.
Text: BERT embeddings filtered by part-of-speech tags and enhanced using TF-IDF.
Acoustic: Features derived using LibROSA and wav2vec2.0, capturing intricate audio nuances.

Contrastive Learning Application

SCL-1 facilitates unimodal representation clustering, utilizing a batch-based contrastive loss function. SSCL refines modality-aimed cross-modal representations via a novel cross-modal Transformer, enhancing feature interaction without requiring additional augmentations. SCL-2 synthesizes these refined multimodal representations, ensuring feature integrity and discriminative accuracy during multiclass predictions.

Experimental Evaluation

The framework's efficacy was tested on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Results demonstrate superior performance across multiple metrics such as F1 score, accuracy, MAE, and Pearson correlation when compared to existing fusion-based methods like TFN, LMF, MulT, and MISA. This shows that the proposed contrastive learning significantly improves modality representation quality.

Dataset and Results

MOSI Dataset: Enhanced performance in most metrics, indicating effective representation refinement but showing slight limitation in accuracy.
MOSEI Dataset: Overall improvement highlighting refined modality-invariant features.
CH-SIMS Dataset: Demonstrated the model's adaptability across modality-specific tasks, maintaining high accuracy and correlation scores.

Conclusion

The multi-view contrastive learning framework successfully advances MSA by producing highly discriminative representations. This approach paves the way for improved sentiment analysis accuracy, emphasizing enhanced feature differentiation and interaction through contrastive methodologies. Future research may explore broader applications of contrastive learning across modalities and its integration with other feature extraction techniques.