Comparative Analysis of Large Language Models in Healthcare

Published 11 Apr 2026 in cs.CL | (2604.10316v1)

Abstract: Background: LLMs are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that no single LLM excels across all healthcare tasks, highlighting significant trade-offs between factual retrieval and semantic accuracy.
It employs standardized medical tasks—including MCQ answering, evidence reasoning, and clinical note summarization with BLEU, ROUGE-L, and BERTScore—to compare performance.
Findings advocate a hybrid task-routing approach, suggesting domain-specific models like ChatDoctor for safety-critical applications while using general models for administrative tasks.

Comparative Evaluation of LLMs in Healthcare

Introduction and Motivation

The expansion of LLMs into the healthcare sector introduces advanced support for clinical decision-making, patient interaction, and medical documentation due to their capabilities in text comprehension, summarization, and generative tasks. However, deployment within high-stakes clinical settings exposes inherent risks of inaccuracy, unreliability, and data hallucination, underscoring the necessity for rigorous, task-specific benchmarking. Existing literature has inadequately addressed systematic comparative evaluation of LLMs for healthcare, often focusing on proprietary datasets, single domains, or coarse-grained metrics.

The study systematically benchmarks five LLMs—ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor—across representative medical NLP tasks to delineate their respective strengths, limitations, and suitability for clinical integration.

Figure 1: LLMs supporting tasks such as clinical documentation, patient triage, literature search, summarization, and patient-facing chatbots in healthcare.

Task Domains, Hallucinations, and Evaluation Protocol

The selected tasks span: (1) medical multiple-choice question answering (MedMCQA), (2) research evidence-based yes/no/maybe reasoning (PubMedQA), and (3) clinical note summarization (Asclepius). This design assesses both discriminative and generative capabilities for factual recall, nuanced reasoning, and semantic text generation. Standardized prompt templates and evaluation metrics were applied across all models to ensure comparability. Medical hallucinations are categorized in the study as factual, logical, or random, aligning with known risk modes for LLMs in clinical settings.

Figure 2: Factual, logical, and random hallucination types with clinical ramifications for model-generated outputs.

Model Selection and Dataset Curation

The study incorporates both domain-specific and general-purpose LLMs:

ChatDoctor: LLaMA-derivative, fine-tuned with medical corpora and enforced safety alignment
Grok 3 Mini, Gemini 2.5 Flash Lite, GPT-4o-Mini: Modern, general-purpose architectures of varying parameter efficiency and inference latency
LLaMA-3.1-8B: Representative, open-source baseline for reproducibility

Tasks are evaluated on robust, open datasets—MedMCQA (183k MCQs), PubMedQA (1k annotated questions), and Asclepius (2k synthetic clinical notes)—subjected to normalization and metadata stripping for consistent downstream analysis.

Quantitative Performance Across Medical NLP Tasks

Medical Question Answering (MedMCQA, PubMedQA)

In MCQ answering, all models achieved accuracy in the 0.711–0.833 range, with high Macro-F1, reflecting reliable knowledge retrieval across broad medical subdomains. For PubMedQA, which demands interpretation of biomedical research and uncertainty reasoning, accuracy ranged from 0.592–0.794, with lower Macro-F1 (0.283–0.379), pinpointing a limitation in class-balancing and uncertainty detection.

Figure 3: MCQ accuracy distribution across LLMs highlighting uniformity in baseline recall on MedMCQA.

Figure 4: PubMedQA accuracy, illustrating impaired model balance and sensitivity for nuanced yes/no/maybe responses.

Notably, general-purpose models (Grok, LLaMA) excelled in structured fact retrieval, while domain-specific models (ChatDoctor) exhibited enhanced semantic and contextual caution.

Clinical Note Summarization (Asclepius)

Summary generation was evaluated using BLEU, ROUGE-L, and BERTScore metrics, with BLEU scores spanning 6.89–12.57, ROUGE-L ranging 0.195–0.271, and BERTScore F1 between 0.125–0.222. ChatDoctor and Gemini prioritized semantic fidelity, natively avoiding unsupported hallucinations, in contrast to Grok, which favored lexical overlap at the expense of deeper clinical correctness.

Figure 5: BLEU score performance for Asclepius clinical summarization, sorted by model.

Cross-Task Performance, Model Trade-offs, and Multidimensional Analysis

A multi-dimensional radar/heatmap analysis consolidated results and emphasized task-dependent performance divergence. Grok led factual structured recall, while ChatDoctor dominated clinical summarization and semantic preservation. LLaMA-3.1-8B unexpectedly performed best in PubMedQA’s uncertainty class.

Figure 6: Left: task-normalized heatmap of per-model scores. Right: radar plot indicating distinct LLM capabilities per dimension.

The bold claim that emerges is the rejection of the "universal model" hypothesis; no single LLM outperforms across all healthcare-relevant axes. Instead, pronounced trade-offs exist between generative semantic precision (ChatDoctor) and fluency in factual Q&A (Grok, LLaMA).

Implications, Limitations, and Future Directions

The results indicate a necessary transition toward hybrid or “task-routing” paradigms—allocating queries to appropriate LLMs by required medical skill or safety threshold rather than defaulting to a one-size-fits-all generalist. Evaluation shows BLEU/ROUGE are insufficient for capturing clinical appropriateness; semantic and context-aware metrics (e.g., BERTScore) are essential for evaluating systems intended for high-stakes settings.

From a practical perspective, the cautious language and explicit uncertainty adopted by domain-specific models bolster arguments for their use in decision-support rather than automation. Conversely, generalist models' efficiency and broad coverage suit non-critical documentation and administrative applications. Adopting LLMs as augmentative rather than replacement technology is necessary to safeguard patient safety and compliance with ethical and legal standards.

This work also exposes substantial limitations, such as prompt sensitivity effects, fixed vs. adaptive prompt risks, and reliance on synthetic or curated datasets that neglect real-world linguistic noise and institutional heterogeneity. There is an unmistakable need for expert-in-the-loop clinical validation, adaptive ensemble methods, and meta-evaluation frameworks tailored to safety-critical medicine.

Conclusion

The comparative study robustly demonstrates the non-universality of LLM effectiveness in healthcare, documenting clear, task-specific trade-offs between domain-tuned and general-purpose models. Domain-adapted models, exemplified by ChatDoctor, provide superior semantic reliability and safety for critical medical documentation, while general-purpose LLMs like Grok and LLaMA perform optimally in knowledge-driven Q&A. Comprehensive clinical deployment will require interactive, multi-dimensional evaluation metrics, hybrid inference strategies, and continuous human oversight to support the nuanced requirements of medical practice and patient care.

Further advancements are necessary in dataset realism, prompt engineering, and robust, semantically-aligned benchmarking to ensure future LLMs meet the stringent standards demanded by the healthcare domain.

Markdown Report Issue