Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Published 23 Dec 2025 in cs.CL, cs.AI, cs.CY, and cs.HC | (2512.20298v1)

Abstract: Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. We present the first direct comparison between state-of-the-art LLMs and mental health professionals in diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders utilizing Polish-language first-person autobiographical accounts. We show that the top-performing Gemini Pro models surpassed human professionals in overall diagnostic accuracy by 21.91 percentage points (65.48% vs. 43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patient's sense of self and temporal experience. Our findings demonstrate that while LLMs are highly competent at interpreting complex first-person clinical data, they remain subject to critical reliability and bias issues.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that top LLMs outperform clinicians in overall accuracy (65.48% vs. 43.57%) while exhibiting systematic biases in diagnosis.
It employs rigorously vetted Polish first-person narratives and multidimensional semantic analysis to assess diagnostic consistency and reasoning styles.
The study highlights LLM overconfidence and categorical bias, calling for enhanced AI reliability and ethical risk mitigation in clinical settings.

Evaluation of LLMs in Diagnosing Personality Disorders from First-Person Narratives

Context and Motivation

This study addresses a gap in the evaluation of LLMs for psychiatric diagnosis: most benchmarks utilize short, structured items or informal social media posts, missing the complexity and authenticity of semi-structured, first-person autobiographical accounts that clinicians commonly encounter. The authors conduct the first direct comparison between state-of-the-art LLMs and experienced mental health professionals in diagnosing Borderline Personality Disorder (BPD) and Narcissistic Personality Disorder (NPD), using high-fidelity Polish-language patient testimonies.

The work is motivated by increasing public reliance on LLMs for self-assessment in psychiatry and by the need for trustworthy evaluation standards, especially as diagnoses of personality disorders rely heavily on nuanced, subjective life histories rather than clear clinical markers.

Experimental Design

Data Acquisition and Selection

Patient narratives were sampled from a Polish psychiatric inpatient setting, meticulously vetting cases for diagnostic clarity, excluding confounding factors, and matching levels of impairment and narrative richness within each disorder. Out of over 200,000 words transcribed, three BPD, three NPD, and one healthy control case were selected. Each was thoroughly characterized using ICD-10 codes and psychometric scales, ensuring rigorous ground-truth standards.

Subjects and Models

Six human experts (three psychiatrists, three psychotherapists) and 16 contemporary LLMs (including Gemini Pro, Claude Opus, GPT-4/5, Llama, DeepSeek, Gemma, Qwen) participated. Examiners were blinded and external to diagnosis, and both groups were provided identical, format-constrained Polish prompts for categorical diagnosis, severity rating, certainty assessment, and textual justification. Each LLM case evaluation was repeated three times to assess reliability.

Metrics

Performance was analyzed through both categorical (binary diagnosis) and dimensional (severity rating) frameworks, reflecting DSM and ICD traditions. Diagnostic consistency across replicates was a key criterion for LLMs. Justifications were embedded via BAAI/bge-multilingual-gemma2 and analyzed through MDS and UMAP projections to probe semantic similarity and reasoning style. Lexical divergence in justification texts was further statistically quantified using weighted log-odds.

Key Findings

Diagnostic Accuracy and Biases

Gemini Pro (2.5 and 3), the highest-performing LLMs, outperformed the human average by ~22 percentage points (65.48% vs. 43.57% overall accuracy).
Both groups exhibited strong recall for BPD ( $F_1$ ≈ 80–83), with models showing greater categorical sensitivity but reduced precision.
NPD was severely underdiagnosed by all models ( $F_1$ = 6.7) versus humans ( $F_1$ = 50.0), with LLMs demonstrating reluctance to assign the value-laden label “narcissism” despite higher severity recall. This indicates models can recognize symptomatic severity but avoid stigmatizing categorical terms, plausibly due to RLHF alignment toward agreeableness.
Models showed a tendency to over-pathologize BPD and misclassify other disorders (notably Avoidant PD), with Gemma and Qwen families producing the most false positives. The GPT models, in particular, exhibited a notable “depathologizing bias,” frequently erroneously labeling cases as healthy.
Models were more confident and uniform in their assessments; none ever employed the lowest certainty (“guessing”), while human experts expressed low confidence frequently.

Reasoning Styles: Semantic and Lexical Analysis

Human justifications were concise, cautious, and self/temporal-experience focused, often admitting insufficient data.
Model justifications were consistently long, confident, and pattern-oriented, emphasizing formal diagnostic criteria and symptom persistence; the language had a higher prevalence of adjectives related to severity, alongside a tendency for pattern inference and, at times, speculative interpretation in data-limited scenarios.
Multidimensional semantic embedding analysis showed strong clustering by model family, with poor-performing models (Qwen, Gemma, GPT-4o) being outliers. Llama 3.3 70B's embeddings were highly atypical due to poor Polish grammar, yet its diagnostic accuracy remained high, suggesting a dissociation between generative fluency and discriminative performance in multilingual models.

Theoretical and Practical Implications

Relation to Psychiatric Nosology

Model performance revealed a tension between categorical and dimensional nosologies. While LLMs demonstrated balanced scores, there was a preference for categorical assignment, likely echoing the prevalence of traditional labels in web-scale corpora and user queries. The underuse of stigmatized labels (NPD) is a direct consequence of alignment choices in modern LLM training.

Reliability and Safety Risks

LLMs’ overconfidence and lack of uncertainty reporting present critical reliability issues in clinical contexts. Accurate uncertainty communication is essential for responsible deployment in mental health, echoing recent calls for anthropomorphic uncertainty in AI [Ulmer et al., (Ulmer et al., 11 Jul 2025)]. The mismatch between subjective certainty (higher for severity ratings) and objective performance highlights risks of miscalibration in both humans and machines.

Reasoning Divergence

Human and machine reasoning diverge fundamentally: the clinician’s approach is centered on the patient’s experience, self-aspects, and temporal narrative, with frequent deferral in cases of ambiguity. LLMs employ a highly formalized, pattern-based strategy, marked by elaboration and engagement with diagnostic criteria but little engagement with uncertainty or hesitation. Crucially, poor diagnostic performance correlates with more atypical semantic profiles, and post-hoc justifications may not reliably reflect discriminative capability, particularly for multilingual models.

Future Directions

Further research should address multimodal diagnostic protocols, larger and more culturally diverse datasets, and direct integration of uncertainty quantification. Model adaptation should include explicit strategies for ethical risk minimization, particularly regarding stigma and depathologizing error. Investigation into generative-discriminative gaps in reasoning—especially in non-English LLMs—is urgently warranted.

A promising path is collaborative frameworks integrating clinical judgment and AI analytical consistency, leveraging strengths while mitigating human and model biases.

Conclusion

This study establishes LLMs as competently capable of interpreting complex first-person clinical data, even surpassing experienced clinicians in constrained textual diagnostics. However, LLM diagnostic validity is undermined by systematic biases, overconfidence, and a semantic gap in patient-centered reasoning. The translation of these findings into practice demands supervised, ethically grounded deployment, harnessing complementary human–AI capabilities for enhanced diagnostic reliability.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Summary

Evaluation of LLMs in Diagnosing Personality Disorders from First-Person Narratives

Context and Motivation

Experimental Design

Data Acquisition and Selection

Subjects and Models

Metrics

Key Findings

Diagnostic Accuracy and Biases

Reasoning Styles: Semantic and Lexical Analysis

Theoretical and Practical Implications

Relation to Psychiatric Nosology

Reliability and Safety Risks

Reasoning Divergence

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

Don't miss out on important new AI/ML research

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Summary

Evaluation of LLMs in Diagnosing Personality Disorders from First-Person Narratives

Context and Motivation

Experimental Design

Data Acquisition and Selection

Subjects and Models

Metrics

Key Findings

Diagnostic Accuracy and Biases

Reasoning Styles: Semantic and Lexical Analysis

Theoretical and Practical Implications

Relation to Psychiatric Nosology

Reliability and Safety Risks

Reasoning Divergence

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research