Papers
Topics
Authors
Recent
Search
2000 character limit reached

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Published 13 Jun 2023 in cs.CL, cs.MM, cs.SD, and eess.AS | (2306.07848v10)

Abstract: Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. A. Ando, R. Masumura, H. Kamiyama, et al., “Customer satisfaction estimation in contact center calls based on a hierarchical multi-task model,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 715–728, 2020.
  2. A. Aftab, A. Morsali, S. Ghaemmaghami, et al., “Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6912–6916.
  3. J. Ye, X. cheng Wen, Y. Wei, et al., “Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition,” 2023.
  4. L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,” arXiv preprint arXiv:2104.03502, 2021.
  5. L.-W. Chen and A. Rudnicky, “Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  6. E. Morais, R. Hoory, W. Zhu, et al., “Speech emotion recognition using self-supervised features,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6922–6926.
  7. I. Gat, H. Aronowitz, W. Zhu, et al., “Speaker normalization for self-supervised speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7342–7346.
  8. A. Baevski, Y. Zhou, A. Mohamed, et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  9. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  10. C. Busso, M. Bulut, C.-C. Lee, et al., “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
  11. Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  12. S. Chen, C. Wang, Z. Chen, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  13. A. Baevski, W.-N. Hsu, Q. Xu, et al., “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning.   PMLR, 2022, pp. 1298–1312.
  14. Y. Pan, Y. Yang, Y. Huang, et al., “Msac: Multiple speech attribute control method for reliable speech emotion recognition,” arXiv preprint arXiv:2308.04025, 2023.
  15. A. Radford, J. W. Kim, C. Hallacy, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  16. B. Elizalde, S. Deshmukh, M. Al Ismail, et al., “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  17. Y. Wu, K. Chen, T. Zhang, et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  18. Y. Meng, X. Li, Z. Wu, et al., “Calm: Constrastive cross-modal speaking style modeling for expressive text-to-speech synthesis,” Proc. Interspeech 2022, pp. 5533–5537, 2022.
  19. W. Chen, X. Xing, X. Xu, et al., “Key-sparse transformer for multimodal speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6897–6901.
  20. D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion of wav2vec 2.0 and bert for multimodal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  21. Z. Zhao, Y. Wang, and Y. Wang, “Knowledge-aware bayesian co-attention for multimodal emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  22. S. Ghosh, U. Tyagi, S. Ramaneswaran, et al., “Mmer: Multimodal multi-task learning for speech emotion recognition,” arXiv preprint arXiv:2203.16794, 2022.
  23. Y. Li, T. Zhao, T. Kawahara, et al., “Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning.” in Interspeech, 2019, pp. 2803–2807.
  24. Y. Liu, H. Sun, W. Guan, et al., “A discriminative feature representation method based on cascaded attention network with adversarial strategy for speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1063–1074, 2023.
Citations (6)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.