- The paper introduces NOHARM, a benchmark derived from 100 authentic eConsult cases that quantifies harm in LLM-generated medical recommendations.
- The paper's evaluation of 31 LLMs reveals that errors of omission account for 76.6% of severe harm, highlighting critical safety gaps.
- Multi-agent orchestration combining diverse LLMs significantly reduces clinical harm, presenting a scalable strategy for safe clinical decision support.
Toward Explicitly Benchmarking Clinical Safety of LLMs: The NOHARM Framework
Introduction and Motivation
The adoption rate of LLMs in clinical decision support (CDS) has rapidly outpaced the development of safety and harm mitigation frameworks. With usage embedded in both routine clinical workflows and direct-to-physician interfaces, there is an urgent need to go beyond knowledge benchmarks and rigorously assess the clinical safety profile of LLMs. The work presented in "First, do NOHARM: towards clinically safe LLMs" (2512.01241) directly addresses the lack of standardized and scalable benchmarks for quantifying potential harm resulting from AI-generated medical recommendations, focusing on actionable patient outcomes rather than simple correctness.
Benchmark Construction: The NOHARM Dataset
NOHARM (Numerous Options Harm Assessment for Risk in Medicine) is a specialist-curated benchmark derived from 100 real-world primary-care-to-specialist eConsult cases, covering 10 major medical specialties. Unlike prior stylized or synthetic vignettes, NOHARM maintains authentic clinical uncertainty, missing information, and diverse real-world context. This dataset contains 4,249 unique, medically plausible clinical management options, each annotated with 12,747 expert harm and appropriateness ratings by 29 specialists/subspecialists. The harm annotation scheme is based on a modified RAND/UCLA Appropriateness Method with harmonization to WHO harm severity definitions, systematically attributing both errors of commission (harmful interventions) and omission (failure to recommend necessary actions).
This granular, action-level annotation establishes a robust, reproducible platform for safety benchmarking across a wide spectrum of contemporary LLMs, including both open and proprietary models, as well as retrieval-augmented generation (RAG) systems.
Main Findings: LLM Safety Performance and Harm Taxonomy
Evaluation of 31 LLMs on the NOHARM benchmark demonstrates that, while LLMs have achieved strong performance on knowledge and reasoning tasks, they continue to produce severely harmful errors at clinically significant rates. Severe harm occurs in up to 22.2% of cases for the lowest-performing models, with the most competitive models yielding 11.8–14.6 severe errors per 100 cases, and "no intervention" baselines performing markedly worse. Crucially, harms of omission—failure to recommend appropriate, often necessary medical actions—account for 76.6% of severe errors, indicating that under-coverage or excessive output restraint is a predominant risk factor, contrary to the typical focus on overt commission errors.
Analysis revealed:
- Safety performance is only moderately correlated with standard medical knowledge and AGI benchmarks (e.g., GPQA-Diamond, LMArena); there is substantial unexplained variance, indicating that knowledge proficiency does not serve as a reliable proxy for safety.
- Certain LLMs now outperform generalist, board-certified human physicians (when restricted to non-AI resources) in the NOHARM Safety metric, with a mean performance gap of 9.7% in favor of the best LLMs. LLMs are also superior in case-level completeness (sensitivity for all highly necessary actions).
- Errors of omission are a major class of clinical risk, exceeding commission errors in both incidence and severity across nearly all models. Top models minimize harm predominantly by reducing missed diagnostic or follow-up actions.
Model- and System-Level Harm Mitigation
The study demonstrates that multi-agent orchestration—where an "Advisor" LLM's output is reviewed and revised by one or more "Guardian" LLMs—substantially enhances safety. Heterogeneous multi-agent systems, especially those combining open-source, proprietary, and RAG models, significantly reduce clinical harm compared to single-model deployments. The most performant 3-agent system (Meta Llama 4. Scout + Google Gemini 2.5 Pro + AMBOSS LiSA 1.0) exhibited an 8.0% Safety improvement over the best solo LLM. Higher diversity at the model and organizational level correlates with improved Safety, Completeness, and Restraint, offering a scalable countermeasure to single-model failure modes.
Additionally, the authors highlight a non-monotonic trade-off between Safety and Restraint: Safety peaks at intermediate levels of output Restraint (precision). Excessively conservative models (e.g., certain OpenAI and Gemini variants) that suppress potentially beneficial recommendations to avoid errors paradoxically incur more severe harm through omitted necessary actions, mirroring known patterns in human clinical error literature.
Implications and Future Directions
This work underscores several critical implications for AI safety research and clinical deployment:
- Current evaluation protocols that focus on knowledge or AGI benchmarks are insufficient for capturing real-world patient safety risks. Explicit, patient-level safety metrics are essential for regulatory clearance and deployment auditing.
- Automation bias and the transition toward human-on-the-loop oversight increase the risk that undetected model errors will reach patients, especially when LLM-generated recommendations are generally accurate but occasionally dangerously incomplete.
- Diversity in AI agent composition is a robust harm mitigation strategy, supporting multi-agent frameworks for CDS deployment.
- Designing for intermediate levels of Restraint (rather than maximal specificity/precision) may optimize safety in high-stakes, multi-action clinical domains.
NOHARM provides the foundation for scalable, continuous surveillance of deployed CDS systems, aligning AI oversight with real-world harm reduction rather than abstract accuracy metrics.
Limitations
Benchmark cases are drawn from primary care–to–specialist outpatient consultations and may not generalize to inpatient settings or low-complexity encounters. While actions are treated as largely independent, real clinical logic often entails complex interdependencies. The focus is exclusively on direct patient medical harm, not on financial or broader system-level consequences.
Conclusion
"First, do NOHARM" (2512.01241) advances the field by operationalizing clinical safety as a measurable, reproducible, and actionable evaluation target for LLMs. The work demonstrates that model scale, reasoning capabilities, and knowledge benchmarks fail to account for most safety variance, and that multi-agent orchestration provides meaningful reductions in severe harm rates. The findings emphasize the necessity for explicit safety-centric benchmarks in regulatory, clinical, and research settings, and set the stage for future architecture and governance innovations in safe AI-driven clinical decision support.