- The paper demonstrates that top LLMs outperform clinicians in overall accuracy (65.48% vs. 43.57%) while exhibiting systematic biases in diagnosis.
- It employs rigorously vetted Polish first-person narratives and multidimensional semantic analysis to assess diagnostic consistency and reasoning styles.
- The study highlights LLM overconfidence and categorical bias, calling for enhanced AI reliability and ethical risk mitigation in clinical settings.
Evaluation of LLMs in Diagnosing Personality Disorders from First-Person Narratives
Context and Motivation
This study addresses a gap in the evaluation of LLMs for psychiatric diagnosis: most benchmarks utilize short, structured items or informal social media posts, missing the complexity and authenticity of semi-structured, first-person autobiographical accounts that clinicians commonly encounter. The authors conduct the first direct comparison between state-of-the-art LLMs and experienced mental health professionals in diagnosing Borderline Personality Disorder (BPD) and Narcissistic Personality Disorder (NPD), using high-fidelity Polish-language patient testimonies.
The work is motivated by increasing public reliance on LLMs for self-assessment in psychiatry and by the need for trustworthy evaluation standards, especially as diagnoses of personality disorders rely heavily on nuanced, subjective life histories rather than clear clinical markers.
Experimental Design
Data Acquisition and Selection
Patient narratives were sampled from a Polish psychiatric inpatient setting, meticulously vetting cases for diagnostic clarity, excluding confounding factors, and matching levels of impairment and narrative richness within each disorder. Out of over 200,000 words transcribed, three BPD, three NPD, and one healthy control case were selected. Each was thoroughly characterized using ICD-10 codes and psychometric scales, ensuring rigorous ground-truth standards.
Subjects and Models
Six human experts (three psychiatrists, three psychotherapists) and 16 contemporary LLMs (including Gemini Pro, Claude Opus, GPT-4/5, Llama, DeepSeek, Gemma, Qwen) participated. Examiners were blinded and external to diagnosis, and both groups were provided identical, format-constrained Polish prompts for categorical diagnosis, severity rating, certainty assessment, and textual justification. Each LLM case evaluation was repeated three times to assess reliability.
Metrics
Performance was analyzed through both categorical (binary diagnosis) and dimensional (severity rating) frameworks, reflecting DSM and ICD traditions. Diagnostic consistency across replicates was a key criterion for LLMs. Justifications were embedded via BAAI/bge-multilingual-gemma2 and analyzed through MDS and UMAP projections to probe semantic similarity and reasoning style. Lexical divergence in justification texts was further statistically quantified using weighted log-odds.
Key Findings
Diagnostic Accuracy and Biases
- Gemini Pro (2.5 and 3), the highest-performing LLMs, outperformed the human average by ~22 percentage points (65.48% vs. 43.57% overall accuracy).
- Both groups exhibited strong recall for BPD (F1 ≈ 80–83), with models showing greater categorical sensitivity but reduced precision.
- NPD was severely underdiagnosed by all models (F1 = 6.7) versus humans (F1 = 50.0), with LLMs demonstrating reluctance to assign the value-laden label “narcissism” despite higher severity recall. This indicates models can recognize symptomatic severity but avoid stigmatizing categorical terms, plausibly due to RLHF alignment toward agreeableness.
- Models showed a tendency to over-pathologize BPD and misclassify other disorders (notably Avoidant PD), with Gemma and Qwen families producing the most false positives. The GPT models, in particular, exhibited a notable “depathologizing bias,” frequently erroneously labeling cases as healthy.
- Models were more confident and uniform in their assessments; none ever employed the lowest certainty (“guessing”), while human experts expressed low confidence frequently.
Reasoning Styles: Semantic and Lexical Analysis
- Human justifications were concise, cautious, and self/temporal-experience focused, often admitting insufficient data.
- Model justifications were consistently long, confident, and pattern-oriented, emphasizing formal diagnostic criteria and symptom persistence; the language had a higher prevalence of adjectives related to severity, alongside a tendency for pattern inference and, at times, speculative interpretation in data-limited scenarios.
- Multidimensional semantic embedding analysis showed strong clustering by model family, with poor-performing models (Qwen, Gemma, GPT-4o) being outliers. Llama 3.3 70B's embeddings were highly atypical due to poor Polish grammar, yet its diagnostic accuracy remained high, suggesting a dissociation between generative fluency and discriminative performance in multilingual models.
Theoretical and Practical Implications
Relation to Psychiatric Nosology
Model performance revealed a tension between categorical and dimensional nosologies. While LLMs demonstrated balanced scores, there was a preference for categorical assignment, likely echoing the prevalence of traditional labels in web-scale corpora and user queries. The underuse of stigmatized labels (NPD) is a direct consequence of alignment choices in modern LLM training.
Reliability and Safety Risks
LLMs’ overconfidence and lack of uncertainty reporting present critical reliability issues in clinical contexts. Accurate uncertainty communication is essential for responsible deployment in mental health, echoing recent calls for anthropomorphic uncertainty in AI [Ulmer et al., (Ulmer et al., 11 Jul 2025)]. The mismatch between subjective certainty (higher for severity ratings) and objective performance highlights risks of miscalibration in both humans and machines.
Reasoning Divergence
Human and machine reasoning diverge fundamentally: the clinician’s approach is centered on the patient’s experience, self-aspects, and temporal narrative, with frequent deferral in cases of ambiguity. LLMs employ a highly formalized, pattern-based strategy, marked by elaboration and engagement with diagnostic criteria but little engagement with uncertainty or hesitation. Crucially, poor diagnostic performance correlates with more atypical semantic profiles, and post-hoc justifications may not reliably reflect discriminative capability, particularly for multilingual models.
Future Directions
Further research should address multimodal diagnostic protocols, larger and more culturally diverse datasets, and direct integration of uncertainty quantification. Model adaptation should include explicit strategies for ethical risk minimization, particularly regarding stigma and depathologizing error. Investigation into generative-discriminative gaps in reasoning—especially in non-English LLMs—is urgently warranted.
A promising path is collaborative frameworks integrating clinical judgment and AI analytical consistency, leveraging strengths while mitigating human and model biases.
Conclusion
This study establishes LLMs as competently capable of interpreting complex first-person clinical data, even surpassing experienced clinicians in constrained textual diagnostics. However, LLM diagnostic validity is undermined by systematic biases, overconfidence, and a semantic gap in patient-centered reasoning. The translation of these findings into practice demands supervised, ethically grounded deployment, harnessing complementary human–AI capabilities for enhanced diagnostic reliability.