Accuracy, justification quality, and human comparability of LLMs on raw first-person clinical narratives

Establish whether large language models can (i) maintain diagnostic accuracy when operating on raw first-person autobiographical testimonies, (ii) provide diagnosis-relevant justifications grounded in those testimonies, and (iii) achieve performance and reasoning that are comparable to those of mental health professionals.

Background

Prior evaluations of LLMs in mental health often use binary or multiple-choice formats and do not examine whether model justifications align with clinical reasoning. These setups differ from real-world psychiatric assessments that rely on patients’ narrative identities and lived experiences.

The authors frame a research gap concerning LLM performance on raw first-person testimonies, the diagnostic relevance of generated explanations, and the comparability of model reasoning and outcomes to those of practicing clinicians.

References

This discrepancy highlights a critical research gap: it remains unknown if LLMs can maintain diagnostic accuracy on raw first-person testimonies, provide diagnosis-relevant justifications, and whether their performance and reasoning compare with those of mental health professionals.