Self-supervised representations in speech-based depression detection

Published 20 May 2023 in cs.CL, cs.SD, and eess.AS | (2305.12263v2)

Abstract: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). An analysis of SSL representations derived from different layers of pre-trained foundation models is first presented for SDD, which provides insight to suitable indicator for depression detection. Knowledge transfer is then performed from automatic speech recognition (ASR) and emotion recognition to SDD by fine-tuning the foundation models. Results show that the uses of oracle and ASR transcriptions yield similar SDD performance when the hidden representations of the ASR model is incorporated along with the ASR textual information. By integrating representations from multiple foundation models, state-of-the-art SDD results based on real ASR were achieved on the DAIC-WOZ dataset.

Abstract PDF HTML Upgrade to Chat

References (25)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces a novel SSL-based framework that leverages speech foundation models for effective depression detection.
It employs a block-wise analysis of Transformer layers, revealing that mid-layer (8th block) representations yield superior performance with improved F1 scores.
Ensemble strategies combining SSL-driven speech features with ASR transcriptions enhance diagnostic accuracy while mitigating data sparsity challenges.

Self-supervised Representations in Speech-based Depression Detection

Introduction

The task of speech-based depression detection (SDD) remains significant given the global prevalence of depression, which affects approximately 280 million individuals worldwide. Current measures for depression detection lack reliable clinical utility. Recognizing the variability in depression manifestations among individuals and the scarcity of training data, the research utilizes foundation models pre-trained with self-supervised learning (SSL) for SDD. The objective is to explore how SSL representations can mitigate data sparsity and facilitate more effective depression detection through successive model fine-tuning and analysis of layer-specific information.

Proposed Model and Methodology

The research advances a binary classification framework leveraging SSL foundation models to classify a speaker's depressive state. The architecture comprises a speech foundation model (e.g., Wav2Vec 2.0, HuBERT, WavLM) followed by a depression detection block. The foundation model, pre-trained on expansive unlabeled datasets, extracts intermediate representations from varied block layers, subsequently pooled and passed to the final detection block. This process illustrates how the foundation model's deep layers encode information pivotal for SDD, particularly word meaning and acoustic features.

Figure 1: (a) Model structure. (b) The block-wise analysis framework.

The foundation models underpinning this study include 12 block layers of Transformer encoders with pre-trained SSL capabilities. Notably, the speech data undergoes augmentation to bolster sample diversity, amalgamating sub-dialog techniques to address sample size inconsistency and balance between identified depressed and non-depressed subjects.

Block-wise SSL Representation Analysis

Investigation reveals that different layers within the foundation models capture distinct types of information. The mid-layers predominantly encode semantic features, which are invaluable to SDD, while the initial blocks focus more on acoustic properties. Analysis performed indicated superior depression detection results using representations from the middle layers—specifically, the 8th block of these models demonstrates optimal performance, surpassing other blocks with a noticeable F1 score improvement.

Figure 2: Trends of DIAC-WOZ F1-avg values at different blocks for the foundation models.

Subsequent to examining pre-trained models, the research conducts a detailed analysis of models fine-tuned for automatic speech recognition (ASR) and emotion recognition (AER). The findings underscore that fine-tuning aligns model layers closer to task-specific information, boosting SDD performance, particularly when utilizing higher-order linguistic features.

ASR Transcriptions and Textual Integration

Textual content, specifically ASR-generated transcriptions, plays a critical role in SDD. Although ASR transcriptions introduce errors, they exhibit commensurable diagnostic performance to reference transcriptions when combined with SSL-derived hidden layers, attesting to their utility in absence of pristine textual data.

Model and Ensemble Combinations

The research further investigates ensemble strategies by integrating SSL speech representations with text models, enhancing the robustness of depression prediction. Such ensembles leverage the complementary strengths of individual modalities (audio and text), achieving state-of-the-art F1 scores on benchmark datasets such as DAIC-WOZ. This method establishes a competitive edge by solely relying on speech inputs, obviating the need for meticulously curated transcriptions.

Conclusion

The paper underscores the capability of SSL frameworks in speech-based depression detection, marking significant advances through strategic layer-wise analysis and fine-tuning for task adaptability. Results indicate that semantic-rich, middle-layer SSL representations greatly aid in detecting depressive features. Integrating multi-modal SSL models, particularly speech and text, compounds the detection accuracy, heralding refined methodologies for future deployment across diverse clinical settings, despite the realization that ASR errors persist. Consequently, the study asserts itself as a definitive approach reducing the barriers in detecting depression via automated speech processing systems.

Markdown Report Issue