- The paper introduces a one-step DCCA approach that effectively fuses text, audio, and video for superior sentiment classification.
- It utilizes neural networks to learn non-linear correlations between modalities, outperforming traditional baseline methods.
- Experimental results on datasets like CMU-MOSI demonstrate enhanced accuracy and F-scores, underscoring the value of multi-modal embeddings.
Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis
Introduction
The paper "Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis" explores the integration of multi-modal data—text, audio, and video—to enhance sentiment classification tasks. The authors propose leveraging Deep Canonical Correlation Analysis (DCCA) to create correlated multi-modal embeddings that capture the nuances present across various data views. Two primary approaches are examined: One-Step DCCA and Two-Step DCCA. These methods aim to address the inherent question of quantifying the significance of each modality in sentiment analysis, especially when multi-modal data offers a richer understanding of human discourse than unimodal data.
Multi-modal Embedding Framework
The experimental framework proposed utilizes DCCA to create multi-modal embeddings from text, audio, and video inputs. DCCA extends traditional Canonical Correlation Analysis (CCA) by using neural networks to learn non-linear correlations in the data. In the One-Step DCCA, audio and video features are combined and then correlated with text features using DCCA. In contrast, Two-Step DCCA applies two stages of DCCA to iteratively combine and correlate each pair of modalities.
- Text Encoding: Uses pre-trained BERT for text features, extracting a 768-dimensional vector representing the linguistic content.
- Audio Encoding: Utilizes COVAREP for feature extraction, with final embeddings being the average of frame-level features resulting in a 74-dimensional vector.
- Video Encoding: Extracted using methods like FACET, yielding 35-dimensional vector representations of video content.
Experimental Setup
Experiments were conducted on three data sets: CMU-MOSI, CMU-MOSEI, and the Debate Emotion data set, evaluating the sentiment classification performance using the proposed DCCA approaches against baseline methods like GCCA and Graph Memory Fusion Networks.
Data Sets
- CMU-MOSI: Consists of segmented product review videos annotated for sentiment.
- CMU-MOSEI: Similar to MOSI but larger, with detailed sentiment annotations.
- Debate Emotion: Focuses on aggression detection from 2016 presidential debates.
Results
The One-Step DCCA outperformed baseline methods, including uni-modal and bi-modal approaches across all datasets. Multi-modal embeddings reliably enhanced classification accuracy, showcasing the strength of correlated feature spaces in capturing sentiment.
- Accuracy and F-Score: The One-Step DCCA consistently achieved higher accuracy and F-scores compared to single or dual modality approaches, demonstrating the benefits of integrating multi-modal data in sentiment analysis tasks.
- Comparison with Baselines: It outshone existing state-of-the-art methods like bc-LSTM combined with visual and audio encoding techniques.
Two-Step DCCA explored various combinations of input views but did not consistently surpass One-Step DCCA, suggesting that optimal dimensionality and correlation strategies remain crucial to achieving superior results.
Discussion and Implications
The research highlights the potential of integrating multi-modal data using DCCA to enhance sentiment analysis. While text embeddings currently dominate due to optimized methods like BERT, audio and video views still significantly contribute to understanding sentiment when appropriately fused. Future work should explore improving audio/video feature extraction, optimizing integration strategies, and extending evaluations to diverse tasks beyond sentiment analysis.
Conclusions
The paper demonstrates that combining multiple modalities via DCCA enhances sentiment classification, but challenges remain due to varying feature optimization levels across modalities. Continued research is necessary to refine feature extraction and integration to unlock the full potential of multi-modal sentiment analysis. This study provides a substantial step towards more comprehensive and accurate sentiment analysis systems using multi-modal inputs.