Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis

Published 21 Nov 2024 in cs.CL | (2411.13800v1)

Abstract: Use of LLMs such as ChatGPT (GPT-4) for mental health support has grown rapidly, emerging as a promising route to assess and help people with mood disorders, like depression. However, we have a limited understanding of GPT-4's schema of mental disorders, that is, how it internally associates and interprets symptoms. In this work, we leveraged contemporary measurement theory to decode how GPT-4 interrelates depressive symptoms to inform both clinical utility and theoretical understanding. We found GPT-4's assessment of depression: (a) had high overall convergent validity (r = .71 with self-report on 955 samples, and r = .81 with experts judgments on 209 samples); (b) had moderately high internal consistency (symptom inter-correlates r = .23 to .78 ) that largely aligned with literature and self-report; except that GPT-4 (c) underemphasized suicidality's -- and overemphasized psychomotor's -- relationship with other symptoms, and (d) had symptom inference patterns that suggest nuanced hypotheses (e.g. sleep and fatigue are influenced by most other symptoms while feelings of worthlessness/guilt is mostly influenced by depressed mood).

Abstract PDF HTML Upgrade to Chat

Authors (12)

Summary

The paper examines GPT-4's schema of depression by analyzing its symptom assessments on user texts, comparing results to PHQ-9 scores and expert judgments.
Key findings show GPT-4's depression assessments correlate highly with expert judgments (r=0.81) and moderately with self-reports (r=0.70), but it exhibits biases in symptom covariance.
The results suggest potential for using LLMs like GPT-4 in mental health screening to address access barriers, emphasizing the need to monitor and refine for biases.

Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis

The paper "Explaining GPT-4's Schema of Depression Using Machine Behavior Analysis" presents a detailed examination of how GPT-4 interprets depressive symptoms using human language and assesses its internal conceptual framework or 'schema' related to depression. Using contemporary measurement theory, the researchers aim to uncover how GPT-4 relates and interprets symptoms of depression, providing insights into both its practical utility and theoretical understanding in mental health diagnostics.

The analysis involved a comprehensive evaluation of GPT-4's ability to estimate depression severity based on user-generated texts. Specifically, the study employed a dataset of 955 individuals who described their depressive experiences in text form and compared these assessments with self-reported PHQ-9 scores and expert judgments. Key findings include:

Convergent Validity: GPT-4 demonstrated high agreement with expert judgments (average $r = 0.81$ ) and moderate agreement with self-reported scores ( $r = 0.70$ ). This indicates that GPT-4 is capable of assessing depression with a high degree of accuracy relative to these established measures.
Symptom Correlations: The internal consistency of symptom correlations observed in GPT-4’s assessments ( $r = 0.23$ to $0.78$) largely mirrors the correlations found in self-reports. However, GPT-4 exhibited unique biases, such as underemphasizing the covariance of suicidality with other symptoms and overemphasizing psychomotor symptoms.
Explicit vs. Implicit Symptom Recognition: The model showed higher precision in estimating explicit symptoms mentioned in texts compared to implicit symptoms inferred by the model, highlighting its reliance on direct linguistic cues for accurate assessment.
Language Analysis: The linguistic analysis revealed that GPT-4 successfully identified depressive markers in language that correlate with traditional clinical symptomatology (e.g., affective terms, somatic descriptors).
Schema of Depression: The schema derived from GPT-4’s assessments suggests that it conceptualizes depression as a network of interacting symptoms, although it diverges from traditional self-reports by altering the emphasis on certain symptoms.

These results have important implications for integrating LLMs like GPT-4 into mental health care. They suggest that GPT-4 can effectively be employed as a diagnostic tool in mental health screening, potentially aiding clinicians by automating the initial stages of mental health assessments. This capability can address barriers in current mental health infrastructures, such as clinician shortages, and improve accessibility for individuals seeking mental health support.

Despite this promising utility, the paper acknowledges several limitations in GPT-4's assessments, particularly its variable precision in handling implicit symptom mentions and potential biases in symptom interpretation. These highlight the need for continuous monitoring and refinement of LLM-induced biases, ensuring responsible deployment in clinical environments.

In summary, this study offers a pivotal step towards understanding how LLMs interpret complex mental health conditions, suggesting their potential contribution to mental health diagnostics while emphasizing the need for responsible development and usage. Future research should focus on longitudinal assessments of GPT-4's diagnostic performance in diverse populations and contexts, further refining its schema for more precise and culturally adaptable mental health applications.

Markdown Report Issue