Self-Supervised Embeddings for Detecting Individual Symptoms of Depression

Published 25 Jun 2024 in cs.SD, cs.LG, and eess.AS | (2406.17229v1)

Abstract: Depression, a prevalent mental health disorder impacting millions globally, demands reliable assessment systems. Unlike previous studies that focus solely on either detecting depression or predicting its severity, our work identifies individual symptoms of depression while also predicting its severity using speech input. We leverage self-supervised learning (SSL)-based speech models to better utilize the small-sized datasets that are frequently encountered in this task. Our study demonstrates notable performance improvements by utilizing SSL embeddings compared to conventional speech features. We compare various types of SSL pretrained models to elucidate the type of speech information (semantic, speaker, or prosodic) that contributes the most in identifying different symptoms. Additionally, we evaluate the impact of combining multiple SSL embeddings on performance. Furthermore, we show the significance of multi-task learning for identifying depressive symptoms effectively.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that freezing self-supervised speech models can accurately identify depression symptoms using diverse speech embeddings.
It shows that combining semantic, speaker, and prosodic information significantly outperforms traditional speech features in symptom detection.
The research highlights the benefit of multi-task learning in reducing computational loads while maintaining high detection accuracy.

Self-Supervised Embeddings for Detecting Individual Symptoms of Depression

Introduction

The paper "Self-Supervised Embeddings for Detecting Individual Symptoms of Depression" (2406.17229) addresses the significant challenge of accurately assessing depression through speech input. It focuses on identifying individual depressive symptoms and predicting overall depression severity using self-supervised learning (SSL)-based speech models. This research is pivotal given the global prevalence of depression and the obstacles in detecting and treating it due to the scarcity of trained clinicians. The study advocates for a clinically oriented approach to automate symptom detection, which could enhance monitoring and treatment outcomes.

SSL Models and Their Utility

This research utilizes a range of SSL speech models pretrained with various objective functions. The models analyzed include HuBERT, WavLM, BEATS, ContentVec, RDINO, AudioMAE, and BYOL-Audio, each encoding different types of speech information such as semantic, speaker, or prosodic (Figure 1). The choice of pre-training objectives helps these models to significantly enhance the detection of depressive symptoms by extracting and leveraging diverse speech features.

Figure 1: Single SSL model.

The models are evaluated for their capacity to identify individual symptoms outlined in the Montgomery and Åsberg Depression Rating Scale (MADRS), a standardized clinical assessment tool for depression severity. By freezing the weights of pretrained models and using them to extract embeddings, the approach efficiently addresses the challenge of limited labeled data, a common issue in medical domains.

Performance Evaluation

The performance metrics are detailed in Table \ref{tab:ssl_single-task}, showing a comparative analysis of these SSL-based models against conventional speech features such as spectrograms, COVAREP, and eGeMAPS. SSL models, particularly those encoding a combination of semantic, speaker, and prosodic information, significantly outperformed these traditional methods. The research highlights the importance of each type of encoded information—the semantic information proved useful for identifying sadness and concentration difficulties, while speaker and prosodic information were critical for symptoms like pessimistic thoughts and suicidal tendencies.

(Table \ref{tab:ssl_single-task})

The study also contrasts single-task and multi-task learning strategies, discovering that the latter offers efficient training while maintaining or improving detection performance across symptoms. Multi-task learning reduces computational demands by training models with shared parameters across tasks related to different symptoms.

Figure 2: Distribution of samples between class-0 (symptom absent) and class-1 (symptom present) for each symptom in the MADRS. Symptom abbreviations are provided in Table \ref{tab:madrs_symptoms}.

Implications and Future Directions

This work not only exemplifies the successful application of SSL models in detecting depressive symptoms but also guides subsequent research in incorporating diverse speech representations for enhanced performance. The findings advocate for the development of multi-faceted models that integrate semantic, speaker, and prosodic information to improve the accuracy of mental health assessments.

The implications of this research are substantial for both clinical applications and future AI developments. Automating symptom detection through SSL models could facilitate early diagnosis and intervention in mental health care, potentially alleviating the burden on healthcare systems. Future research could explore the integration of these models in real-world settings and extend their capabilities to other languages and cultural contexts.

Conclusion

In conclusion, the study demonstrates that SSL-based speech models are a potent tool for detecting individual symptoms of depression. The integration of semantic, speaker, and prosodic information is vital for effective symptom identification. This research paves the way for future advancements in automated mental health assessments, emphasizing the potential for non-invasive, scalable diagnostic tools in clinical practice.

Markdown Report Issue