Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-supervised representations in speech-based depression detection

Published 20 May 2023 in cs.CL, cs.SD, and eess.AS | (2305.12263v2)

Abstract: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). An analysis of SSL representations derived from different layers of pre-trained foundation models is first presented for SDD, which provides insight to suitable indicator for depression detection. Knowledge transfer is then performed from automatic speech recognition (ASR) and emotion recognition to SDD by fine-tuning the foundation models. Results show that the uses of oracle and ASR transcriptions yield similar SDD performance when the hidden representations of the ASR model is incorporated along with the ASR textual information. By integrating representations from multiple foundation models, state-of-the-art SDD results based on real ASR were achieved on the DAIC-WOZ dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. World Health Organization, “Depression,” https://www.who.int/news-room/fact-sheets/detail/depression, Accessed: August 15, 2022.
  2. “A review of depression and suicide risk assessment using speech analysis,” Speech Communication, vol. 71, pp. 10–49, 2015.
  3. “Critical analysis of the impact of glottal features in the classification of clinical depression in speech,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 1, pp. 96–107, 2007.
  4. “Detection of clinical depression in adolescents’ speech during family interactions,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 3, pp. 574–586, 2010.
  5. “Multichannel weighted speech classification system for prediction of major depression in adolescents,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 2, pp. 497–506, 2012.
  6. “Detecting depression using vocal, facial and semantic communication cues,” in Proc. ACM MM, Amsterdam, 2016.
  7. “Detecting depression with audio/text sequence modeling of interviews.,” in Proc. Interspeech, Hyderabad, 2018.
  8. “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  9. “SUPERB: Speech processing universal performance benchmark,” arXiv preprint arXiv:2105.01051, 2021.
  10. “Wav2Vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, Conference held virtually, 2020.
  11. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  12. “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  13. “BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022.
  14. “Speech emotion recognition using self-supervised features,” in Proc. ICASSP, Singapore, 2022.
  15. “RoBERTa: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  16. “SimSensei Kiosk: A virtual human interviewer for healthcare decision support,” in Proc. AAMAS, Paris, 2014.
  17. “Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model,” in Proc. ICASSP, Singapore, 2022.
  18. “Topic modeling based multi-modal depression detection,” in Proc. ACM MM, Mountain View, 2017.
  19. “A Step Towards Preserving Speakers’ Identity While Detecting Depression Via Speaker Disentanglement,” in Proc. Interspeech, Incheon, 2022.
  20. “Climate and weather: Inspecting depression detection via emotion recognition,” in Proc. ICASSP, Singapore, 2022.
  21. “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, Cartagena, 2021.
  22. “Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription,” in Proc. Interspeech, Incheon, 2022.
  23. R. Lotfian and C. Busso, “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings,” IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2019.
  24. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, Pittsburgh, 2006.
  25. “Text-based depression detection on sparse data,” arXiv preprint arXiv:1904.05154, 2019.
Citations (20)

Summary

  • The paper introduces a novel SSL-based framework that leverages speech foundation models for effective depression detection.
  • It employs a block-wise analysis of Transformer layers, revealing that mid-layer (8th block) representations yield superior performance with improved F1 scores.
  • Ensemble strategies combining SSL-driven speech features with ASR transcriptions enhance diagnostic accuracy while mitigating data sparsity challenges.

Self-supervised Representations in Speech-based Depression Detection

Introduction

The task of speech-based depression detection (SDD) remains significant given the global prevalence of depression, which affects approximately 280 million individuals worldwide. Current measures for depression detection lack reliable clinical utility. Recognizing the variability in depression manifestations among individuals and the scarcity of training data, the research utilizes foundation models pre-trained with self-supervised learning (SSL) for SDD. The objective is to explore how SSL representations can mitigate data sparsity and facilitate more effective depression detection through successive model fine-tuning and analysis of layer-specific information.

Proposed Model and Methodology

The research advances a binary classification framework leveraging SSL foundation models to classify a speaker's depressive state. The architecture comprises a speech foundation model (e.g., Wav2Vec 2.0, HuBERT, WavLM) followed by a depression detection block. The foundation model, pre-trained on expansive unlabeled datasets, extracts intermediate representations from varied block layers, subsequently pooled and passed to the final detection block. This process illustrates how the foundation model's deep layers encode information pivotal for SDD, particularly word meaning and acoustic features. Figure 1

Figure 1: (a) Model structure. (b) The block-wise analysis framework.

The foundation models underpinning this study include 12 block layers of Transformer encoders with pre-trained SSL capabilities. Notably, the speech data undergoes augmentation to bolster sample diversity, amalgamating sub-dialog techniques to address sample size inconsistency and balance between identified depressed and non-depressed subjects.

Block-wise SSL Representation Analysis

Investigation reveals that different layers within the foundation models capture distinct types of information. The mid-layers predominantly encode semantic features, which are invaluable to SDD, while the initial blocks focus more on acoustic properties. Analysis performed indicated superior depression detection results using representations from the middle layers—specifically, the 8th block of these models demonstrates optimal performance, surpassing other blocks with a noticeable F1 score improvement. Figure 2

Figure 2

Figure 2

Figure 2: Trends of DIAC-WOZ F1-avg values at different blocks for the foundation models.

Subsequent to examining pre-trained models, the research conducts a detailed analysis of models fine-tuned for automatic speech recognition (ASR) and emotion recognition (AER). The findings underscore that fine-tuning aligns model layers closer to task-specific information, boosting SDD performance, particularly when utilizing higher-order linguistic features.

ASR Transcriptions and Textual Integration

Textual content, specifically ASR-generated transcriptions, plays a critical role in SDD. Although ASR transcriptions introduce errors, they exhibit commensurable diagnostic performance to reference transcriptions when combined with SSL-derived hidden layers, attesting to their utility in absence of pristine textual data.

Model and Ensemble Combinations

The research further investigates ensemble strategies by integrating SSL speech representations with text models, enhancing the robustness of depression prediction. Such ensembles leverage the complementary strengths of individual modalities (audio and text), achieving state-of-the-art F1 scores on benchmark datasets such as DAIC-WOZ. This method establishes a competitive edge by solely relying on speech inputs, obviating the need for meticulously curated transcriptions.

Conclusion

The paper underscores the capability of SSL frameworks in speech-based depression detection, marking significant advances through strategic layer-wise analysis and fine-tuning for task adaptability. Results indicate that semantic-rich, middle-layer SSL representations greatly aid in detecting depressive features. Integrating multi-modal SSL models, particularly speech and text, compounds the detection accuracy, heralding refined methodologies for future deployment across diverse clinical settings, despite the realization that ASR errors persist. Consequently, the study asserts itself as a definitive approach reducing the barriers in detecting depression via automated speech processing systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.