Health-ORSC-Bench: Unified Health AI Benchmark
- Health-ORSC-Bench is a suite of benchmark frameworks that unify operational research and safety-critical evaluations in healthcare, covering LLM safety, OR scheduling, and causal inference.
- It employs challenge-based metrics such as over-refusal and safe completion rates, resource efficiency in scheduling, and kernel-based tests to assess causal estimations under censoring.
- The frameworks facilitate reproducible, domain-specific evaluations that drive methodological advancements and robust algorithm deployment in critical healthcare applications.
Health-ORSC-Bench denotes multiple benchmark frameworks unified by their focus on operational research and safety-critical evaluation in healthcare, with applications spanning LLM safety calibration, resource-constrained operating room scheduling, and causal inference under censoring. These frameworks enable rigorous, domain-specific, and reproducible benchmarking across distinct subfields, filling critical gaps in the evaluation of algorithmic solutions to healthcare challenges.
1. Origins and Conceptual Scope
The term “Health-ORSC-Bench” has been applied to unrelated but methodological intensive initiatives:
- In LLM safety, Health-ORSC-Bench (Zhang et al., 25 Jan 2026) formalizes the systematic test of over-refusal and safe completion for health-related dialogue, targeting model calibration at intent boundaries.
- In hospital operations, Health-ORSC-Bench (Dodaro et al., 2021) describes a parameterized suite for evaluating Answer Set Programming (ASP) solutions to Operating Room Scheduling (ORS) with integrated bed management and rescheduling under disruptions.
- In causal inference, Health-ORSC-Bench (Demirel et al., 2024) specifies instance-wise tests for benchmarking observational studies against RCTs in right-censored time-to-event contexts.
In all cases, the purpose is to provide a reproducible, richly annotated testbed for critical healthcare decision settings where stakes are high and model performance or robustness beyond average-case metrics is required.
2. LLM Safety: Over-Refusal and Safe Completion Calibration
Health-ORSC-Bench (Zhang et al., 25 Jan 2026) addresses limitations of prior LLM safety benchmarks, which mostly focus on binary compliance/refusal with overtly harmful medical prompts (e.g., MedSafetyBench, HealthBench, CARES). This suite introduces the joint measurement of:
- Over-Refusal Rate (ORR): Fraction of benign, borderline prompts for which the model issues a refusal response.
- Safe Completion Rate (SCR): Fraction of benign boundary prompts for which the model provides helpful, high-level guidance, as opposed to refusing or giving unsafe output.
Prompts are stratified by ambiguity (Easy, Medium, Hard), reflecting how many state-of-the-art models refuse each. The framework’s pipeline involves collecting and validating harmful seed prompts, generating 31,920 benign boundary prompts across seven categories (biological/chemical harm, drug abuse, health privacy, medical misinformation, mental abuse, self-harm, unqualified medical advice), and filtering them through ensemble moderation. Thirty leading LLMs—including GPT-5, Claude-4, Qwen-3-Next, domain-tuned models—are evaluated with consistent hyperparameters (temperature 0.0, max_length 4096, no system prompt).
Key findings include “safety-pessimism” in frontier models (e.g., GPT-5 refuses 66.8% of Hard-1K benign prompts; GPT-OSS-120B 81.1%), low ORR but weak safety in domain-tuned models, and an observed trade-off frontier between ORR and SCR that no current model fully resolves. The authors recommend integrating context-aware confidence estimation, output-centric safety optimization, and systematic retraining using Health-ORSC-Bench for better calibration (Zhang et al., 25 Jan 2026).
3. Resource-Constrained Operating Room Scheduling Benchmark
In operating room scheduling, Health-ORSC-Bench (Dodaro et al., 2021) is a public benchmark suite designed to evaluate ASP-based (Answer Set Programming) solutions to the high-stakes ORS problem, incorporating detailed resource and patient-flow constraints:
- Instance Structure: Three “5-day” hospital scenarios (A: abundant beds, B: mild shortage, C: extreme shortage), each with multiple specialties and realistic resource allocations.
- Parameterization: Exact specialty OR counts, session durations, patient registration distributions, surgery durations (sampled from specialty-specific normal distributions), priority scores (all P₁ must be scheduled, then maximize P₂/P₃), bed/ICU capacities, and length-of-stay models are drawn to reflect a typical Italian hospital.
- Rescheduling: Dedicated sub-benchmarks (scenarios I–IV) force disruption (postponement of scheduled patients) to evaluate rescheduling logic with constraints to minimize patient drops and day-shift magnitude.
- ASP Encodings: Scheduling and rescheduling problems are encoded with rules covering no-double-booking, surgery/bed requirements, daily capacity enforcement, hard and weak constraints for priority satisfaction.
- Evaluation: Metrics include number of assigned registrations by priority, operating room time efficiency, and bed occupancy efficiency, with observed scalability to 15-day planning horizons and realistic performance under stress conditions.
- Web Front-End: Enables interactive scenario specification, ASP-based solving (via clingo), and real-time visualization of assignments and utilization.
All scripts and data are open for reproducible research (Dodaro et al., 2021).
4. Causal Inference Benchmarking Under Right-Censoring
In causal benchmarking, Health-ORSC-Bench (Demirel et al., 2024) provides a methodology for testing the robustness of causal conclusions drawn from observational studies (OS) versus randomized controlled trials (RCT), specifically targeting survival outcomes subject to right-censoring:
- Estimands and Data Structure: Observed data includes covariates , treatment , observed time , and censoring indicator ; the estimand is the conditional average treatment effect (CATE), .
- CDR Signal: Construction of the censoring-doubly-robust (CDR) signal provides unbiased estimation of arm-specific counterfactual means under independent-censoring and can accommodate both propensity and outcome model misspecification (as long as one is correctly specified).
- Equivalence Test: Uses the maximum-moment-restriction (MMR) statistic with RKHS kernels to test , directly benchmarking OS against RCT on the instance level. Bootstrap methods (wild/exchangeable) provide critical values for hypothesis testing.
- Robustness: The framework remains valid under “global” dependent censoring (shared mechanism across studies), leveraging the property that any unidentifiable censoring bias cancels in the OS–RCT contrast.
- Empirics: Extensive evaluation on synthetic data and the Women’s Health Initiative demonstrates controlled Type-I error and strong power under external/internal validity violations. The approach avoids inflated Type-I error seen in naïve IPW or IPCW procedures (Demirel et al., 2024).
5. Cross-Benchmark Methodologies and Comparative Summary
A unifying aspect of the Health-ORSC-Bench frameworks is meticulous parameterization and scenario stratification to surface model failures and robustness limits:
| Application Area | Benchmark Focus | Core Metrics | Data/Instance Size | Key Reference |
|---|---|---|---|---|
| LLM Safety | Over-refusal, SCR | ORR, SCR, TRR | 31,920 prompts | (Zhang et al., 25 Jan 2026) |
| OR Scheduling (ASP) | Resource-constrained | OR/bed efficiency, P₁–P₃ asg. | 30–250+ instances | (Dodaro et al., 2021) |
| Causal Inference | OS–RCT gap under censoring | MMR test, CDR-signal | IHDP, WHI datasets | (Demirel et al., 2024) |
Each framework defines precise mathematical or algorithmic structures:
- Explicit Bayes or minimax estimands (causal inference).
- ASP rule-based constraint satisfaction (ORS).
- Automated LLM response labeling and stratified evaluation for ambiguity boundaries (LLM safety).
Open-source code and data release are a common operational principle, ensuring full replicability and extension potential.
6. Significance and Implications
Health-ORSC-Bench frameworks collectively raise the standards for health AI and operational benchmarking. In the LLM safety domain, calibrating over-refusal vs. safe completion addresses key utility–risk tensions inadequately covered by prior work, with fine-grained intent stratification now available. In operations research, ASP-based evaluation supports robust, resource-aware scheduling under complex bed and disruption constraints, reflecting real-world hospital supply chain dynamics. In causal inference, the doubly-robust, kernel-based benchmarking corrects for censoring artifacts, providing principled falsification of observational conclusions. The unified availability of rigorous, domain-tailored benchmarks is positioned to drive methodological advancement, highlight trade-offs in real-world deployment, and enforce accountability in high-impact health AI systems.