- The paper’s main contribution is HazardArena, a benchmark that isolates semantic risk through matched safe/unsafe twin scenarios.
- It introduces stage-wise metrics (attempt, commit, success) to reveal underlying hazardous progression despite successful task completion.
- A training-free Safety Option Layer demonstrates effective semantic gating, highlighting the need for integrated risk reasoning in VLA systems.
Semantic Safety Evaluation in Vision-Language-Action Models with HazardArena
Introduction
Vision-Language-Action (VLA) models have become central to the development of generalist robotic agents capable of interpreting visual input, understanding natural language instructions, and autonomously generating low-level control actions. Recent scaling trends have yielded VLA systems with robust manipulation and task execution abilities across heterogeneous domains, achieved via large-scale trajectory learning and pre-trained vision-language backbones. However, standard evaluation protocols overwhelmingly measure action success, neglecting the subtler but critical dimension of semantic safety—whether the policy internalizes which actions are contextually impermissible beyond mere physical capability. The paper "HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models" (2604.12447) addresses this deficiency by introducing HazardArena, a controlled evaluation benchmark that explicitly tests VLA models' semantic risk awareness and proposes a lightweight intervention for inference-time safety enforcement.
HazardArena: Design and Methodology
HazardArena is motivated by the challenge of distinguishing genuine, semantics-driven action refusal from simple motor incapability in VLA models. Conventional benchmarks fail to decouple these failure modes, occasionally labeling prudent refusal as inability and missing cases where technically competent policies undertake hazardous actions due solely to lack of semantic grounding. HazardArena leverages "safe/unsafe twin" scenario construction to resolve this ambiguity: for each hazardous setting, a matched safe counterpart is generated wherein the physical environment, object placements, and action requirements are held constant while the semantic permissibility of the task is varied. This design isolates the effect of semantic context on agent behavior.
Figure 1: HazardArena's architecture generates seed-matched safe/unsafe twins to evaluate semantic safety independently of physical interaction feasibility.
HazardArena includes over 2,000 assets and 40 risk-sensitive tasks categorically grounded in the ISO 13482:2014 standard and real-world robotics scenarios. Seven risk categories—food safety, property safety, chemical, privacy, fire, personal (human-centered), and electrical hazards—are defined, each with category-aligned assets. Scenarios are composed using structured skill templates (e.g., Pick-Place, Pour, Insert) where only semantic bindings (actor/target pairs, instruction text) distinguish safe from unsafe twins. This enables capability-aware, context-sensitive safety assessment without confounding physical feasibility.
Figure 2: Matched safe and unsafe twins for an Insert-Outlet task, holding motor requirements constant while altering only risk-critical semantic factors.
Evaluation Metrics and Experimental Protocol
Standard endpoint success rates are inadequate for diagnosing semantic risk. To address this, the paper introduces stage-wise evaluation metrics—attempt, commit, and success—which quantify, respectively, initial task engagement, irreversible hazardous progression (pre-irreversible physical effect, or pre-IPE), and final task completion. These metrics are automatically extracted using simulator-detectable predicates for each scenario, allowing for fine-grained attribution of unsafe behaviors and revealing cases where policies progress deep into hazardous configurations without observing terminal success.
Models evaluated include OpenVLA-OFT, π0​, NORA, and VLA-Adapter, all fine-tuned exclusively on safe demonstrations. Performance is measured across matched safe/unsafe twins, with particular attention to the correlation between increased safe-task capability and hazard completion in unsafe twins. Results indicate that safe-only fine-tuning improves manipulation proficiency across agents, but simultaneously amplifies the propensity to execute semantically unsafe actions in unsafe contexts—a bold claim that proficiency alone does not induce contextual safety-awareness absent explicit negative sampling or risk supervision.



Figure 3: Stage-wise event rates (attempt, commit, success) for π0​-ft-20k reveal high hazardous progression in unsafe twins despite only modest differences in success rates.
Safety Option Layer: Training-Free Semantic Gating
HazardArena further evaluates the efficacy of a Safety Option Layer (SOL), a training-free, inference-time safety filter that intercepts policy actions before execution. SOL offers two modes:
- SOL-L1: Attribute-constraint gating, using transparent, hand-coded rules over object-action attributes to block impermissible interactions (e.g., liquid source × electrical device × pour).
- SOL-L2: External vision-LLM (VLM) judge, prompted on current context and action proposals to issue binary safety decisions with risk scores.
SOL is applied prior to reaching critical (commit) configurations, enforcing semantic constraints without impacting benign action execution. Empirical results show SOL-L1 achieves near-optimal hazard suppression with negligible performance loss on safe twins, while SOL-L2's effectiveness is category-dependent—robust for physical risks like electrical or privacy hazards, but less reliable for value-based or commonsense safety violations (e.g., property loss or food contamination).

Figure 4: Application of SOL to π0​ on safe and unsafe twins demonstrates L1's high precision and L2's increased but category-dependent hazard suppression.
Implications, Limitations, and Future Directions
The findings of HazardArena establish that current VLA systems, when trained solely for task mastery, tend to overgeneralize action templates and inadequately encode semantic refusal. Endpoint success rates on hazardous twins can substantially underestimate semantic risk, as models may reach pre-IPE configurations indicative of genuine hazardous intent even when execution is brittle. Lightweight, training-free semantic gating can achieve broad hazard suppression, but policy-level integration remains necessary for robust, scalable safety in open environments.
From a practical perspective, HazardArena's framework facilitates systematic benchmarking and development of semantically risk-aware robotic agents. Theoretically, the benchmark demands a reevaluation of reward design, curriculum composition, and the efficacy of externalized versus endogenous safety interventions in embodied learning, motivating further research into context-aware RL algorithms, refusal-aware imitation, and integrated vision-language reasoning for safety. Key open questions remain regarding the transferability of attribute-rule systems, the calibration of VLM-based judges, and the extension to long-horizon, multi-agent, and compound-risk scenarios.
Conclusion
HazardArena introduces a rigorous, capability-controlled framework for diagnosing and mitigating semantic risk in Vision-Language-Action models. The analyses demonstrate that current VLA models often lack intrinsic semantic safety, with increased manipulation ability amplifying hazard completion when semantic context shifts. Stage-wise metrics and safe/unsafe twin design expose failure modes unobservable under standard evaluations. While a Safety Option Layer can intercept many unsafe actions post hoc, achieving robust, semantically grounded safe behavior will require architectural advances in policy learning and risk reasoning. HazardArena lays the necessary groundwork for the next generation of risk-aware embodied intelligence.