Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis

Published 15 Jul 2019 in cs.IR, cs.CL, cs.LG, and stat.ML | (1907.08696v1)

Abstract: This paper learns multi-modal embeddings from text, audio, and video views/modes of data in order to improve upon down-stream sentiment classification. The experimental framework also allows investigation of the relative contributions of the individual views in the final multi-modal embedding. Individual features derived from the three views are combined into a multi-modal embedding using Deep Canonical Correlation Analysis (DCCA) in two ways i) One-Step DCCA and ii) Two-Step DCCA. This paper learns text embeddings using BERT, the current state-of-the-art in text encoders. We posit that this highly optimized algorithm dominates over the contribution of other views, though each view does contribute to the final result. Classification tasks are carried out on two benchmark datasets and on a new Debate Emotion data set, and together these demonstrate that the one-Step DCCA outperforms the current state-of-the-art in learning multi-modal embeddings.

Abstract PDF Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper introduces a one-step DCCA approach that effectively fuses text, audio, and video for superior sentiment classification.
It utilizes neural networks to learn non-linear correlations between modalities, outperforming traditional baseline methods.
Experimental results on datasets like CMU-MOSI demonstrate enhanced accuracy and F-scores, underscoring the value of multi-modal embeddings.

Introduction

The paper "Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis" explores the integration of multi-modal data—text, audio, and video—to enhance sentiment classification tasks. The authors propose leveraging Deep Canonical Correlation Analysis (DCCA) to create correlated multi-modal embeddings that capture the nuances present across various data views. Two primary approaches are examined: One-Step DCCA and Two-Step DCCA. These methods aim to address the inherent question of quantifying the significance of each modality in sentiment analysis, especially when multi-modal data offers a richer understanding of human discourse than unimodal data.

The experimental framework proposed utilizes DCCA to create multi-modal embeddings from text, audio, and video inputs. DCCA extends traditional Canonical Correlation Analysis (CCA) by using neural networks to learn non-linear correlations in the data. In the One-Step DCCA, audio and video features are combined and then correlated with text features using DCCA. In contrast, Two-Step DCCA applies two stages of DCCA to iteratively combine and correlate each pair of modalities.

Unimodal Feature Extraction

Text Encoding: Uses pre-trained BERT for text features, extracting a 768-dimensional vector representing the linguistic content.
Audio Encoding: Utilizes COVAREP for feature extraction, with final embeddings being the average of frame-level features resulting in a 74-dimensional vector.
Video Encoding: Extracted using methods like FACET, yielding 35-dimensional vector representations of video content.

Experimental Setup

Experiments were conducted on three data sets: CMU-MOSI, CMU-MOSEI, and the Debate Emotion data set, evaluating the sentiment classification performance using the proposed DCCA approaches against baseline methods like GCCA and Graph Memory Fusion Networks.

Data Sets

CMU-MOSI: Consists of segmented product review videos annotated for sentiment.
CMU-MOSEI: Similar to MOSI but larger, with detailed sentiment annotations.
Debate Emotion: Focuses on aggression detection from 2016 presidential debates.

Results

The One-Step DCCA outperformed baseline methods, including uni-modal and bi-modal approaches across all datasets. Multi-modal embeddings reliably enhanced classification accuracy, showcasing the strength of correlated feature spaces in capturing sentiment.

Accuracy and F-Score: The One-Step DCCA consistently achieved higher accuracy and F-scores compared to single or dual modality approaches, demonstrating the benefits of integrating multi-modal data in sentiment analysis tasks.
Comparison with Baselines: It outshone existing state-of-the-art methods like bc-LSTM combined with visual and audio encoding techniques.

Two-Step DCCA explored various combinations of input views but did not consistently surpass One-Step DCCA, suggesting that optimal dimensionality and correlation strategies remain crucial to achieving superior results.

Discussion and Implications

The research highlights the potential of integrating multi-modal data using DCCA to enhance sentiment analysis. While text embeddings currently dominate due to optimized methods like BERT, audio and video views still significantly contribute to understanding sentiment when appropriately fused. Future work should explore improving audio/video feature extraction, optimizing integration strategies, and extending evaluations to diverse tasks beyond sentiment analysis.

Conclusions

The paper demonstrates that combining multiple modalities via DCCA enhances sentiment classification, but challenges remain due to varying feature optimization levels across modalities. Continued research is necessary to refine feature extraction and integration to unlock the full potential of multi-modal sentiment analysis. This study provides a substantial step towards more comprehensive and accurate sentiment analysis systems using multi-modal inputs.

Markdown Report Issue