HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

Published 14 Apr 2026 in cs.RO | (2604.12447v1)

Abstract: Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action execution using semantic attributes or a vision-language judge, substantially reducing unsafe behaviors with minimal impact on task performance. We hope that HazardArena highlights the need to rethink how semantic safety is evaluated and enforced in VLAs as they scale toward real-world deployment.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper’s main contribution is HazardArena, a benchmark that isolates semantic risk through matched safe/unsafe twin scenarios.
It introduces stage-wise metrics (attempt, commit, success) to reveal underlying hazardous progression despite successful task completion.
A training-free Safety Option Layer demonstrates effective semantic gating, highlighting the need for integrated risk reasoning in VLA systems.

Semantic Safety Evaluation in Vision-Language-Action Models with HazardArena

Introduction

Vision-Language-Action (VLA) models have become central to the development of generalist robotic agents capable of interpreting visual input, understanding natural language instructions, and autonomously generating low-level control actions. Recent scaling trends have yielded VLA systems with robust manipulation and task execution abilities across heterogeneous domains, achieved via large-scale trajectory learning and pre-trained vision-language backbones. However, standard evaluation protocols overwhelmingly measure action success, neglecting the subtler but critical dimension of semantic safety—whether the policy internalizes which actions are contextually impermissible beyond mere physical capability. The paper "HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models" (2604.12447) addresses this deficiency by introducing HazardArena, a controlled evaluation benchmark that explicitly tests VLA models' semantic risk awareness and proposes a lightweight intervention for inference-time safety enforcement.

HazardArena: Design and Methodology

HazardArena is motivated by the challenge of distinguishing genuine, semantics-driven action refusal from simple motor incapability in VLA models. Conventional benchmarks fail to decouple these failure modes, occasionally labeling prudent refusal as inability and missing cases where technically competent policies undertake hazardous actions due solely to lack of semantic grounding. HazardArena leverages "safe/unsafe twin" scenario construction to resolve this ambiguity: for each hazardous setting, a matched safe counterpart is generated wherein the physical environment, object placements, and action requirements are held constant while the semantic permissibility of the task is varied. This design isolates the effect of semantic context on agent behavior.

Figure 1: HazardArena's architecture generates seed-matched safe/unsafe twins to evaluate semantic safety independently of physical interaction feasibility.

HazardArena includes over 2,000 assets and 40 risk-sensitive tasks categorically grounded in the ISO 13482:2014 standard and real-world robotics scenarios. Seven risk categories—food safety, property safety, chemical, privacy, fire, personal (human-centered), and electrical hazards—are defined, each with category-aligned assets. Scenarios are composed using structured skill templates (e.g., Pick-Place, Pour, Insert) where only semantic bindings (actor/target pairs, instruction text) distinguish safe from unsafe twins. This enables capability-aware, context-sensitive safety assessment without confounding physical feasibility.

Figure 2: Matched safe and unsafe twins for an Insert-Outlet task, holding motor requirements constant while altering only risk-critical semantic factors.

Evaluation Metrics and Experimental Protocol

Standard endpoint success rates are inadequate for diagnosing semantic risk. To address this, the paper introduces stage-wise evaluation metrics—attempt, commit, and success—which quantify, respectively, initial task engagement, irreversible hazardous progression (pre-irreversible physical effect, or pre-IPE), and final task completion. These metrics are automatically extracted using simulator-detectable predicates for each scenario, allowing for fine-grained attribution of unsafe behaviors and revealing cases where policies progress deep into hazardous configurations without observing terminal success.

Models evaluated include OpenVLA-OFT, $\pi_0$ , NORA, and VLA-Adapter, all fine-tuned exclusively on safe demonstrations. Performance is measured across matched safe/unsafe twins, with particular attention to the correlation between increased safe-task capability and hazard completion in unsafe twins. Results indicate that safe-only fine-tuning improves manipulation proficiency across agents, but simultaneously amplifies the propensity to execute semantically unsafe actions in unsafe contexts—a bold claim that proficiency alone does not induce contextual safety-awareness absent explicit negative sampling or risk supervision.

Figure 3: Stage-wise event rates (attempt, commit, success) for $\pi_0$ -ft-20k reveal high hazardous progression in unsafe twins despite only modest differences in success rates.

Safety Option Layer: Training-Free Semantic Gating

HazardArena further evaluates the efficacy of a Safety Option Layer (SOL), a training-free, inference-time safety filter that intercepts policy actions before execution. SOL offers two modes:

SOL-L1: Attribute-constraint gating, using transparent, hand-coded rules over object-action attributes to block impermissible interactions (e.g., liquid source × electrical device × pour).
SOL-L2: External vision-LLM (VLM) judge, prompted on current context and action proposals to issue binary safety decisions with risk scores.

SOL is applied prior to reaching critical (commit) configurations, enforcing semantic constraints without impacting benign action execution. Empirical results show SOL-L1 achieves near-optimal hazard suppression with negligible performance loss on safe twins, while SOL-L2's effectiveness is category-dependent—robust for physical risks like electrical or privacy hazards, but less reliable for value-based or commonsense safety violations (e.g., property loss or food contamination).

Figure 4: Application of SOL to $\pi_0$ on safe and unsafe twins demonstrates L1's high precision and L2's increased but category-dependent hazard suppression.

Implications, Limitations, and Future Directions

The findings of HazardArena establish that current VLA systems, when trained solely for task mastery, tend to overgeneralize action templates and inadequately encode semantic refusal. Endpoint success rates on hazardous twins can substantially underestimate semantic risk, as models may reach pre-IPE configurations indicative of genuine hazardous intent even when execution is brittle. Lightweight, training-free semantic gating can achieve broad hazard suppression, but policy-level integration remains necessary for robust, scalable safety in open environments.

From a practical perspective, HazardArena's framework facilitates systematic benchmarking and development of semantically risk-aware robotic agents. Theoretically, the benchmark demands a reevaluation of reward design, curriculum composition, and the efficacy of externalized versus endogenous safety interventions in embodied learning, motivating further research into context-aware RL algorithms, refusal-aware imitation, and integrated vision-language reasoning for safety. Key open questions remain regarding the transferability of attribute-rule systems, the calibration of VLM-based judges, and the extension to long-horizon, multi-agent, and compound-risk scenarios.

Conclusion

HazardArena introduces a rigorous, capability-controlled framework for diagnosing and mitigating semantic risk in Vision-Language-Action models. The analyses demonstrate that current VLA models often lack intrinsic semantic safety, with increased manipulation ability amplifying hazard completion when semantic context shifts. Stage-wise metrics and safe/unsafe twin design expose failure modes unobservable under standard evaluations. While a Safety Option Layer can intercept many unsafe actions post hoc, achieving robust, semantically grounded safe behavior will require architectural advances in policy learning and risk reasoning. HazardArena lays the necessary groundwork for the next generation of risk-aware embodied intelligence.