MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Published 26 May 2025 in cs.CL and cs.AI | (2505.23802v2)

Abstract: While LLMs achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MedHELM, a framework built on a clinician-validated taxonomy covering 121 tasks in 5 medical categories to assess LLM real-world performance.
It integrates 35 benchmarks, blending existing and newly created datasets with an innovative LLM-jury evaluation to enhance accuracy over traditional metrics.
Results show top models excel in tasks like Clinical Note Generation while highlighting challenges in structured reasoning and cost-efficient deployment.

Holistic Evaluation of LLMs for Medical Tasks

The paper "MedHELM: Holistic Evaluation of LLMs for Medical Tasks" addresses the gap between high performances of LLMs on medical exams and their applicability in real-world clinical settings. The study introduces MedHELM, an evaluation framework aimed at assessing LLM performance on medical tasks through a clinician-validated taxonomy and comprehensive benchmarks, providing insights into the cost-performance trade-off and scalability of medical applications.

Framework of MedHELM

The MedHELM framework encompasses three primary components:

Clinician-Validated Taxonomy: The taxonomy was developed with input from 29 clinicians, organizing 121 tasks into 5 main categories and 22 subcategories. This structure allows comprehensive evaluation across the spectrum of medical activities. The categories include Clinical Decision Support, Clinical Note Generation, Patient Communication, Medical Research Assistance, and Administration.
Figure 1: This figure illustrates: (a) a clinician-validated taxonomy organizing 121 medical tasks into 5 categories and 22 subcategories; (b) a suite of benchmarks that map existing benchmarks to this taxonomy and introduces new benchmarks for complete coverage; and (c) an evaluation comparing reasoning and non-reasoning LLMs, with model rankings, LLM jury-based evaluation of open-ended benchmarks, and cost-performance analysis.
Benchmark Suite: A total of 35 benchmarks span all subcategories, composed of 17 existing benchmarks and 18 newly developed ones, providing exhaustive coverage of tasks. Benchmarks are divided into open-ended and closed-ended types, with a mix of existing, reformulated, and newly created datasets. This suite ensures that models are evaluated on real-world tasks rather than just exam questions.
Evaluation Methodology: MedHELM employs an LLM-jury evaluation method for open-ended benchmarks, where the outputs of models are assessed by LLMs themselves, ensuring agreement with clinician ratings. This method demonstrated superior accuracy compared to traditional evaluation metrics like ROUGE-L and BERTScore.

Results and Performance Analysis

The evaluation of nine frontier LLMs reveals significant performance variability:

Top Performers: DeepSeek R1 and o3-mini stood out as reasoning models with win-rates of 66% and 64% respectively, particularly excelling in Clinical Note Generation and Medical Research Assistance.
Cost-Efficiency: The Claude 3.5 Sonnet model was notably cost-effective, achieving competitive performance at about 40% less computational cost compared to leading models.
Figure 2: Heatmap of normalized scores (0–1) for each model (rows) across 35 benchmarks (columns). Dark green indicates high performance; dark red indicates low performance.
Category Performance: Models performed best in Clinical Note Generation and Patient Communication tasks, attributed to natural language processing strengths, while structured reasoning tasks like Clinical Decision Support posed challenges.
Figure 3: Mean normalized scores (0-1 scale) across the 5 categories for all evaluated models. Darker green represents higher scores.

Cost-Performance Trade-Off

A critical consideration in MedHELM is the balance between performance and computational cost:

Figure 4: Scatter plot of mean win-rate (y-axis) versus estimated computational cost (x-axis) for each of the 9 models across 35 benchmarks.

Trade-Off Insights: While high-performing reasoning models like DeepSeek R1 incur greater computational expenses, models such as Claude 3.5 Sonnet provide a sweet spot, achieving nearly top-tier results with significantly lower costs.

Implications and Future Directions

MedHELM's framework highlights the importance of diverse, real-world benchmarks for evaluating medical AI. By providing an open-source platform, it encourages continuous improvement and adaptation as LLM capabilities evolve. Future developments could refine evaluation methods, particularly for complex, subjective tasks in Clinical Decision Support and Administration workflows.

Conclusion

"MedHELM: Holistic Evaluation of LLMs for Medical Tasks" establishes a comprehensive framework to assess LLMs in a clinical context, emphasizing practical performance over theoretical benchmarks. It sets a foundation for real-world AI deployment in healthcare, promoting transparency and ongoing refinement of medical task evaluation.