Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fusion approaches for emotion recognition from speech using acoustic and text-based features

Published 27 Mar 2024 in cs.LG, cs.SD, and eess.AS | (2403.18635v1)

Abstract: In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Challenges in real-life emotion annotation and machine learning based detection,” Neural Networks, vol. 18, no. 4, pp. 407–422, May 2005.
  2. “Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011,” Artificial Intelligence Review, vol. 43, no. 2, pp. 155–177, Feb. 2015.
  3. “An experimental study of speech emotion recognition based on deep convolutional neural networks,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Sept. 2015, pp. 827–831.
  4. “Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms,” in Interspeech, Aug. 2017, pp. 1089–1093.
  5. “Speech emotion recognition with acoustic and lexical features,” in ICASSP, April 2015, pp. 4749–4753.
  6. “Multi-modal emotion recognition from speech and text,” in International Journal of Computational Linguistics & Chinese Language Processing: Special Issue on New Trends of Speech and Language Processing, August 2004, vol. 9, pp. 45–62.
  7. “Salience based lexical features for emotion recognition,” in ICASSP, 2017, pp. 5830–5834.
  8. “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” in ICASSP, May 2004, vol. 1, pp. I–577.
  9. “Audio-linguistic embeddings for spoken sentences,” in ICASSP, 2019, pp. 7355–7359.
  10. “Multimodal speech emotion recognition using audio and text,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 112–118.
  11. “Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts,” in Interspeech 2019, Sept. 2019, pp. 51–55.
  12. “Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription,” in Interspeech, Sept. 2019, pp. 3302–3306.
  13. “Exploiting Acoustic and Lexical Properties of Phonemes to Recognize Valence from Speech,” in ICASSP, May 2019, pp. 5871–5875.
  14. “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, Minneapolis, USA, 2019, pp. 4171–4186.
  15. “Glove: Global vectors for word representation,” in EMNLP, 2014.
  16. “Multi-Modal Sentiment Analysis Using Deep Canonical Correlation Analysis,” in Interspeech, Sept. 2019, pp. 1323–1327.
  17. “Iemocap: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Nov 2008.
  18. “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. PP, pp. 1–1, 08 2017.
  19. “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  20. “Opensmile: The munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, New York, NY, USA, 2010, pp. 1459–1462.
  21. “Understanding the difficulty of training deep feedforward neural networks,” in In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10), 2010.
  22. “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
  23. “No sample left behind: Towards a comprehensive evaluation of speech emotion recognition system,” in Proc. Workshop on Speech, Music and Mind 2019, 2019.
  24. “Adam: A method for stochastic optimization,” in 3rd International Conference for Learning Representations, 2015.
  25. François Chollet et al., “Keras,” https://keras.io, 2015.
  26. “Learning alignment for multimodal emotion recognition from speech,” in Interspeech, 2019, pp. 3569–3573.
  27. “Deep Hierarchical Fusion with Application in Sentiment Analysis,” in Interspeech 2019, Sept. 2019, pp. 1646–1650.
Citations (46)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.