OpenAI HealthBench Overview

Updated 12 January 2026

OpenAI HealthBench is an open-source, physician-curated benchmark that uses multi-turn clinical dialogues and detailed rubrics to assess LLM performance in health settings.
It employs a fine-grained, rubric-based scoring system across five behavioral axes, including accuracy, completeness, context awareness, communication quality, and instruction following.
The framework features consensus and hard subsets to rigorously evaluate clinical reasoning and safety, guiding the development of reliable, AI-enabled clinical support systems.

OpenAI HealthBench is an open-source, physician-curated benchmark for the comprehensive evaluation of LLMs in medical and health-related settings. Developed to address the limitations of traditional multiple-choice and recall-based medical AI assessments, HealthBench employs realistic, multi-turn dialogues and fine-grained, rubric-based scoring to capture essential competencies such as clinical reasoning, contextual awareness, communication quality, and instruction adherence. The framework is designed to ground progress in the development of reliable, safe, and effective AI-enabled clinical support systems, with a particular focus on high-stakes, open-ended scenarios and on identifying unsolved challenges in model behavior (Arora et al., 13 May 2025, Ravichandran et al., 29 Aug 2025).

1. Benchmark Structure and Scope

HealthBench comprises approximately 5,000 multi-turn conversations between LLMs and users, who may be either laypersons seeking health advice or medical professionals engaging in structured clinical tasks. Each conversation is paired with an example-specific rubric crafted by one or more of 262 participating physicians spanning 60 countries and 26 specialties. The resulting benchmark covers a wide array of health contexts, including but not limited to:

Emergency referrals
Context-seeking (recognition/solicitation of missing information)
Global health (local guideline adaptation, resource constraints)
Health data transformation tasks (e.g., SOAP note conversion)
Expertise-tailored communication (clinician- vs. patient-facing responses)
Uncertainty handling
Response depth (simple vs. exhaustive recommendations)

Each rubric consists of 2–48 criteria (median 11 per example), yielding over 48,500 unique criteria in total. Each criterion receives a weight in $[-10, +10]$ , permitting both positive reinforcement for correct behaviors and strong penalization of dangerous or inappropriate ones. Rubric items are assigned to one of five behavioral axes: Completeness, Accuracy, Context Awareness, Communication Quality, and Instruction Following (Arora et al., 13 May 2025).

Two key benchmark subsets are provided:

Consensus Subset: 3,671 examples containing one or more physician-validated consensus criteria (34 in total).
Hard Subset: 1,000 hand-picked, clinically complex examples designed to expose failure modes of state-of-the-art models, particularly for ambiguity, emergency escalation, rare conditions, and global health constraints (Ravichandran et al., 29 Aug 2025).

2. Scoring Methodology and Evaluation Rubric

HealthBench uses a rubric-driven, multi-axis scoring methodology designed to reflect both the depth and breadth of clinically relevant model behavior.

For a given example $i$ with $M_i$ rubric criteria:

Each criterion $j$ has a weight $p_{ij} \in [-10, +10]$ and an indicator $r_{ij} \in \{0, 1\}$ denoting whether it is met by the model's response.
The raw score is $\sum_{j=1}^{M_i} r_{ij} p_{ij}$ .
The normalization denominator is $\sum_{j=1}^{M_i} \max(0, p_{ij})$ .
The per-example score is $s_i = \frac{\sum_{j=1}^{M_i} r_{ij}p_{ij}}{\sum_{j=1}^{M_i} \max(0, p_{ij})}$ , clipped to $[0, 1]$ .
The overall HealthBench score $S = \text{clip}(\frac{1}{N}\sum_{i=1}^N s_i, 0, 1)$ .

Axis-level and theme-level scores are computed similarly, restricting the aggregation to the relevant subset of criteria.

Behavioral Axis	Example Attributes Evaluated
Accuracy	Factual correctness, guideline concordance, uncertainty
Completeness	Inclusion of all essential, clinically relevant information
Context Awareness	Adaptivity to user role/locale, clarifying info solicitation
Communication Quality	Clarity, structure, technical appropriateness
Instruction Following	Task format adherence, compliance with user request

Evaluation is performed by a combination of physician graders and automated grading systems (e.g., GPT-4-based graders), with regular human spot-checks to quantify grader-model agreement. The full grading rubric and evaluation codebase are published under an open license, and all baseline model scores are publicly available (Arora et al., 13 May 2025, Mutisya et al., 31 Jul 2025).

3. Empirical Model Benchmarks

HealthBench scores model performance across its five axes, enabling granular comparison between general-purpose LLMs and specialized agentic medical assistants. As of 2025, leading models and their reported scores on the Hard Subset (N=1,000) are:

Model	Accuracy	Communication	Instruction Following	Completeness	Context Awareness	Overall Score
DR.INFO	0.56	0.65	0.59	0.43	0.35	0.51
GPT-5 (thinking mode)	–	–	–	–	–	0.46
o3	–	–	–	–	–	0.32

In direct comparisons over a 100-sample slice of HealthBench Hard, DR.INFO also surpasses OpenEvidence and Pathway.md in most axes, with a particularly notable lead in context awareness and completeness. Bootstrapped resampling demonstrated the statistical significance of these leads at the 90% confidence level, with overlapping intervals at 95% and the suggestion that larger-scale comparisons may yield even stronger separation (Ravichandran et al., 29 Aug 2025).

Detailed performance analyses reveal that—though recent models have more than doubled HealthBench scores in two years—the gap between average and worst-case performance remains wide; e.g., o3's average 60% score on full HealthBench drops below 40% in worst-at-16 sampling, highlighting persistent reliability and safety gaps (Arora et al., 13 May 2025).

4. Variations, Extensions, and Safety Frameworks

HealthBench introduces several distinct evaluation regimes:

Consensus version: Restricts analysis to 34 consensus criteria, producing a high-precision, lower-recall metric focused on robustly validated behavioral attributes.
Hard version: Focuses on the most challenging clinical scenarios specifically engineered or identified to stress model failure cases and ambiguity.

Safety assessment is integral: the rubric includes explicit safety-related criteria, such as the requirement for timely emergency escalation and penalization of hallucinated data in data transformation tasks. HealthBench facilitates worst-case reliability estimation (e.g., worst-at-k analysis) to quantify robustness under adversarial or stochastic deployment. The benchmark also supports introspection into cost-performance tradeoffs and model variance through repeated evaluation (Arora et al., 13 May 2025).

5. Critical Appraisal: Evidence Hierarchy, Biases, and Global Applicability

Recent critical evaluations have exposed limitations in HealthBench’s reliance on physician opinion as the ground truth. The rubric’s judgments typically reflect base-of-pyramid evidence, sometimes overweighting single-expert views above high-tier clinical guidelines (systematic reviews, RCTs). Penalties for single rubric violations can outweigh aggregated strong-evidence guidelines, potentially codifying individual bias, regional idiosyncrasies, and limited generalizability (Mutisya et al., 31 Jul 2025).

Coverage for neglected tropical diseases (NTDs) and global guideline diversity remains incomplete: e.g., HIV receives ≈2.8% coverage, while malaria and schistosomiasis remain rare despite high prevalence. Immunization rubrics may misalign with country-specific protocols. A further limitation is the frequent use of single-turn or brief dialogues, constraining evaluation of multi-turn memory, follow-up, or triage escalation skills.

To address these issues, proposals center on integrating version-controlled Clinical Practice Guidelines (CPGs) and GRADE-weighted rubric linkage into HealthBench. The revised scoring framework would anchor criterion weights to formal evidence hierarchies, include context-aware override logic (to account for local constraints or non-standard care), and incorporate delayed patient outcome feedback by matching benchmark dialogues with real-world EHR outcomes. This is expected to enhance equity, mitigate regional bias, and promote global clinical relevance (Mutisya et al., 31 Jul 2025).

6. Data Management, Privacy, and Synthetic Data Integration

Benchmarks in the medical domain must rigorously protect PHI and comply with local and international data protection laws (POPIA, Data Protection Act, etc.). HealthBench's architecture supports integration with synthetic, high-fidelity health datasets generated by platforms such as Health Gym, which uses WGAN-GP with correlation-alignment loss to create synthetic ICU hypotension, sepsis, and HIV datasets. These datasets have passed multiple statistical and privacy validation stages: distributions and correlations closely match original data, and re-identification risk remains well below established thresholds (Kuo et al., 2022).

HealthBench may be deployed atop such datasets for standardized, privacy-preserving RL and LLM benchmarking. Practitioners may use HealthBench dialogue and rubric frameworks while loading synthetic environments into OpenAI Gym-compatible RL agents for clinical policy evaluation, reward construction, and model comparison (Kuo et al., 2022).

7. Future Directions and Ongoing Development

Planned evolutions of HealthBench emphasize the transition from static, expert-driven rubrics to living, evidence-anchored evaluation frameworks:

Incorporation of CPG-to-rubric linkage with versioned identifiers, automated FHIR/CQL transformation, and traceability ledgers
Scaling of NTD and global guideline coverage through jurisdiction-specific rubric crosswalks and local clinician authorship
Contextual override with algorithmic equity guardrails and audit logs
Release automation with continuous integration/delivery for rubric updates, guideline changes, and revision tracking
Integration of delayed patient outcome feedback via EHR signals and real-world post-deployment tracking
Ongoing transparency via open-source code, published rubric audits, and a public misgrading bug tracker

These developments are intended to maintain the benchmark’s role as an unsaturated, meaningful discriminator of safety and reliability in health LLMs while improving trustworthiness, global equity, and evidence fidelity (Mutisya et al., 31 Jul 2025, Arora et al., 13 May 2025).