Health-ORSC-Bench: Unified Health AI Benchmark

Updated 1 February 2026

Health-ORSC-Bench is a suite of benchmark frameworks that unify operational research and safety-critical evaluations in healthcare, covering LLM safety, OR scheduling, and causal inference.
It employs challenge-based metrics such as over-refusal and safe completion rates, resource efficiency in scheduling, and kernel-based tests to assess causal estimations under censoring.
The frameworks facilitate reproducible, domain-specific evaluations that drive methodological advancements and robust algorithm deployment in critical healthcare applications.

Health-ORSC-Bench denotes multiple benchmark frameworks unified by their focus on operational research and safety-critical evaluation in healthcare, with applications spanning LLM safety calibration, resource-constrained operating room scheduling, and causal inference under censoring. These frameworks enable rigorous, domain-specific, and reproducible benchmarking across distinct subfields, filling critical gaps in the evaluation of algorithmic solutions to healthcare challenges.

1. Origins and Conceptual Scope

The term “Health-ORSC-Bench” has been applied to unrelated but methodological intensive initiatives:

In LLM safety, Health-ORSC-Bench (Zhang et al., 25 Jan 2026) formalizes the systematic test of over-refusal and safe completion for health-related dialogue, targeting model calibration at intent boundaries.
In hospital operations, Health-ORSC-Bench (Dodaro et al., 2021) describes a parameterized suite for evaluating Answer Set Programming (ASP) solutions to Operating Room Scheduling (ORS) with integrated bed management and rescheduling under disruptions.
In causal inference, Health-ORSC-Bench (Demirel et al., 2024) specifies instance-wise tests for benchmarking observational studies against RCTs in right-censored time-to-event contexts.

In all cases, the purpose is to provide a reproducible, richly annotated testbed for critical healthcare decision settings where stakes are high and model performance or robustness beyond average-case metrics is required.

2. LLM Safety: Over-Refusal and Safe Completion Calibration

Health-ORSC-Bench (Zhang et al., 25 Jan 2026) addresses limitations of prior LLM safety benchmarks, which mostly focus on binary compliance/refusal with overtly harmful medical prompts (e.g., MedSafetyBench, HealthBench, CARES). This suite introduces the joint measurement of:

Over-Refusal Rate (ORR): Fraction of benign, borderline prompts for which the model issues a refusal response.
Safe Completion Rate (SCR): Fraction of benign boundary prompts for which the model provides helpful, high-level guidance, as opposed to refusing or giving unsafe output.

Prompts are stratified by ambiguity (Easy, Medium, Hard), reflecting how many state-of-the-art models refuse each. The framework’s pipeline involves collecting and validating harmful seed prompts, generating 31,920 benign boundary prompts across seven categories (biological/chemical harm, drug abuse, health privacy, medical misinformation, mental abuse, self-harm, unqualified medical advice), and filtering them through ensemble moderation. Thirty leading LLMs—including GPT-5, Claude-4, Qwen-3-Next, domain-tuned models—are evaluated with consistent hyperparameters (temperature 0.0, max_length 4096, no system prompt).

Key findings include “safety-pessimism” in frontier models (e.g., GPT-5 refuses 66.8% of Hard-1K benign prompts; GPT-OSS-120B 81.1%), low ORR but weak safety in domain-tuned models, and an observed trade-off frontier between ORR and SCR that no current model fully resolves. The authors recommend integrating context-aware confidence estimation, output-centric safety optimization, and systematic retraining using Health-ORSC-Bench for better calibration (Zhang et al., 25 Jan 2026).

3. Resource-Constrained Operating Room Scheduling Benchmark

In operating room scheduling, Health-ORSC-Bench (Dodaro et al., 2021) is a public benchmark suite designed to evaluate ASP-based (Answer Set Programming) solutions to the high-stakes ORS problem, incorporating detailed resource and patient-flow constraints:

Instance Structure: Three “5-day” hospital scenarios (A: abundant beds, B: mild shortage, C: extreme shortage), each with multiple specialties and realistic resource allocations.
Parameterization: Exact specialty OR counts, session durations, patient registration distributions, surgery durations (sampled from specialty-specific normal distributions), priority scores (all P₁ must be scheduled, then maximize P₂/P₃), bed/ICU capacities, and length-of-stay models are drawn to reflect a typical Italian hospital.
Rescheduling: Dedicated sub-benchmarks (scenarios I–IV) force disruption (postponement of scheduled patients) to evaluate rescheduling logic with constraints to minimize patient drops and day-shift magnitude.
ASP Encodings: Scheduling and rescheduling problems are encoded with rules covering no-double-booking, surgery/bed requirements, daily capacity enforcement, hard and weak constraints for priority satisfaction.
Evaluation: Metrics include number of assigned registrations by priority, operating room time efficiency, and bed occupancy efficiency, with observed scalability to 15-day planning horizons and realistic performance under stress conditions.
Web Front-End: Enables interactive scenario specification, ASP-based solving (via clingo), and real-time visualization of assignments and utilization.

All scripts and data are open for reproducible research (Dodaro et al., 2021).

4. Causal Inference Benchmarking Under Right-Censoring

In causal benchmarking, Health-ORSC-Bench (Demirel et al., 2024) provides a methodology for testing the robustness of causal conclusions drawn from observational studies (OS) versus randomized controlled trials (RCT), specifically targeting survival outcomes subject to right-censoring:

Estimands and Data Structure: Observed data includes covariates $X$ , treatment $A$ , observed time $\tilde T = \min(T, C)$ , and censoring indicator $\Delta$ ; the estimand is the conditional average treatment effect (CATE), $\tau(x) = \mathbb{E}[T(1)-T(0) \mid X=x]$ .
CDR Signal: Construction of the censoring-doubly-robust (CDR) signal provides unbiased estimation of arm-specific counterfactual means under independent-censoring and can accommodate both propensity and outcome model misspecification (as long as one is correctly specified).
Equivalence Test: Uses the maximum-moment-restriction (MMR) statistic with RKHS kernels to test $\mathbb{E}[\psi^{\mathrm{CDR}}| X] = 0$ , directly benchmarking OS against RCT on the instance level. Bootstrap methods (wild/exchangeable) provide critical values for hypothesis testing.
Robustness: The framework remains valid under “global” dependent censoring (shared mechanism across studies), leveraging the property that any unidentifiable censoring bias cancels in the OS–RCT contrast.
Empirics: Extensive evaluation on synthetic data and the Women’s Health Initiative demonstrates controlled Type-I error and strong power under external/internal validity violations. The approach avoids inflated Type-I error seen in naïve IPW or IPCW procedures (Demirel et al., 2024).

5. Cross-Benchmark Methodologies and Comparative Summary

A unifying aspect of the Health-ORSC-Bench frameworks is meticulous parameterization and scenario stratification to surface model failures and robustness limits:

Application Area	Benchmark Focus	Core Metrics	Data/Instance Size	Key Reference
LLM Safety	Over-refusal, SCR	ORR, SCR, TRR	31,920 prompts	(Zhang et al., 25 Jan 2026)
OR Scheduling (ASP)	Resource-constrained	OR/bed efficiency, P₁–P₃ asg.	30–250+ instances	(Dodaro et al., 2021)
Causal Inference	OS–RCT gap under censoring	MMR test, CDR-signal	IHDP, WHI datasets	(Demirel et al., 2024)

Each framework defines precise mathematical or algorithmic structures:

Explicit Bayes or minimax estimands (causal inference).
ASP rule-based constraint satisfaction (ORS).
Automated LLM response labeling and stratified evaluation for ambiguity boundaries (LLM safety).

Open-source code and data release are a common operational principle, ensuring full replicability and extension potential.

6. Significance and Implications

Health-ORSC-Bench frameworks collectively raise the standards for health AI and operational benchmarking. In the LLM safety domain, calibrating over-refusal vs. safe completion addresses key utility–risk tensions inadequately covered by prior work, with fine-grained intent stratification now available. In operations research, ASP-based evaluation supports robust, resource-aware scheduling under complex bed and disruption constraints, reflecting real-world hospital supply chain dynamics. In causal inference, the doubly-robust, kernel-based benchmarking corrects for censoring artifacts, providing principled falsification of observational conclusions. The unified availability of rigorous, domain-tailored benchmarks is positioned to drive methodological advancement, highlight trade-offs in real-world deployment, and enforce accountability in high-impact health AI systems.