MovSafeBench: Safety Benchmark for Mobile Agents

Updated 20 January 2026

MovSafeBench is a vision-language benchmark that assesses safety-critical spatio-temporal reasoning abilities in autonomous systems amid hazardous scenarios.
It employs the HazardForge pipeline to generate realistic static, motion, intrusion, and distance hazards through sophisticated image editing and quality validation.
Empirical evaluations reveal significant accuracy drops in VLMs under dynamic and anomalous conditions, underscoring the need for enhanced causal motion understanding and anomaly detection.

MovSafeBench is a multiple-choice vision-language benchmark targeting safety-critical spatio-temporal reasoning for mobile agents, including autonomous vehicles and robots. It systematically evaluates the robustness of vision-LLMs (VLMs) under diverse, hazardous scenarios typified by complex spatial and motion dynamics—especially anomalies such as unexpected animal crossings, occlusions, and distant hazards. MovSafeBench is constructed via HazardForge, an automatic image-editing pipeline aimed at generating high-fidelity safety scenarios, and is designed to expose fundamental limitations in current VLMs' decision-making capabilities, particularly under conditions resembling real-world safety-critical navigation (Taniguchi et al., 13 Jan 2026).

1. HazardForge Pipeline: Generation of Spatio-Temporal Scenarios

MovSafeBench is generated through the HazardForge pipeline, which automates four scenario types relevant to agent navigation: static, motion, intrusion, and distance hazards. The process consists of three tightly-coupled subsystems:

Layout Decision Algorithms: Input images $I \in \mathbb{R}^{H \times W \times 3}$ are split horizontally into left ( $\Omega_L$ ), center ( $\Omega_C$ ), and right ( $\Omega_R$ ) regions. For motion scenarios, moving objects are placed in $\Omega_C$ with trajectories conflicting with the safe action. Intrusion scenarios use horizontal outpainting to place partially visible, intruding objects at boundaries, utilizing location-specific masks. Distance scenarios exploit vanishing point detection and variable object scaling to simulate far hazards.
Image Editing Models: For insertion, off-the-shelf inpainting/outpainting models $M$ synthesize context-conforming objects, conditioned on mask $m$ and prompt $t \in \mathcal{T}$ . Object classes include both normal (e.g., pedestrians, vehicles) and anomalous entities (e.g., wild animals, debris).
Validation Modules: Each generated image undergoes a structured quality check using a strong VLM (Qwen3-VL-30B), which labels object completeness and direction. Only samples passing checks on both axes ("complete" & correct orientation) are retained, enforcing high fidelity and label reliability (Taniguchi et al., 13 Jan 2026).

2. Dataset Composition and Statistics

MovSafeBench provides 7,254 images and corresponding QA pairs aligned across 13 object categories: 4 common (human, motorcycle, bicycle, cone) and 9 anomalous (rocks, debris, roadkill, dog, cat, deer, fox, pig, raccoon). The sampling draws on two base datasets (DriveBench and SA-Bench), each contributing 200 raw images, further diversified by editing into the four scenario types and a static baseline.

Scenario	No. Images	Key Object Categories covered
Static	3,093	All 13
Motion	981	9 (excluding cones, rocks, debris...)
Intrusion	1,999	9 (excluding static-only entities)
Distance	1,181	All 13

Scenario allocation is constrained by object-context plausibility (e.g., cones, rocks appear only in static context). Each image-question pair presents a single ground-truth safe direction, ensuring precise MCQ-based labeling (Taniguchi et al., 13 Jan 2026).

3. Scenario Typology and Spatio-Temporal Properties

MovSafeBench explicitly models four scenario types:

Static: Addition of front-facing hazards outside the ground-truth safe region; intended to require spatial exclusion reasoning.
Motion: Dynamic hazards in the center, moving counter to the safe action; stresses temporal and trajectory prediction.
Intrusion: Hazards partially emerging from image boundaries; tests recognition of occlusion and side-entry anomalies.
Distance: Small, far objects in the ground-truth region plus large proximal hazards elsewhere; probes depth estimation and vanishing-point reliance.

Each type is defined by mathematical layout rules, spatial masking, and editing strategies to create real-world plausible, semantically distinct challenges affecting agent navigation decisions (Taniguchi et al., 13 Jan 2026).

4. Evaluation Protocol and Metrics

The benchmark employs a multiple-choice question per scenario image: "Given this scene, which action is safe? (A) go left, (B) go straight, (C) go right." Answers are grounded in the ground-truth safe region $a_{k^*} \in \{a_L, a_C, a_R\}$ . The main evaluation metric is accuracy (percent correct MCQ response). Scenario-wise degradation is measured as: $\Delta_{\rm scen} = \mathrm{Acc}_{\rm no\_edit} - \mathrm{Acc}_{\rm scen}$ The motion-understanding metric is the MCQ accuracy within motion scenarios. Benchmarked models include Qwen2.5-VL (7B, 32B), InternVL3.5, LLaVA-NEXT (7B, 13B), Phi4-Mini, and PaliGemma2. Human annotator performance is established as a baseline (3 annotators × 300 samples). All evaluations utilize 4× NVIDIA H100 SXM5 for medium-scale high-throughput testing (Taniguchi et al., 13 Jan 2026).

5. Empirical Findings and Model Robustness

MovSafeBench exposes pronounced deficiencies in contemporary VLMs under hazardous, dynamic, or anomalous conditions:

Average accuracy drops from 55.6% (unedited images) to 41.4% on edited scenarios. Notably, the motion scenario yields the steepest drop (mean 24.8%), whereas intrusion (46.4%) and static (45.1%) are less challenging. Distance scenarios yield an intermediate difficulty (37.3%).
Anomalous object classes systematically degrade performance (e.g., dog: 34.8%; cat: 36.5%) relative to common objects (e.g., human: 42.9%; cone: 46.3%).
Human annotators achieve roughly 88% accuracy in motion; leading VLMs fail dramatically (~25%), evidencing inadequate temporal reasoning.
Performance degrades more in driving scenes (DriveBench, 40.5%) compared to robot navigation scenes (SA-Bench, 42.8%).

These results indicate that SOTA VLMs lack robust spatio-temporal reasoning and fail to generalize to safety-critical, anomalous contexts—a fundamental barrier to their adoption in real-world autonomous systems (Taniguchi et al., 13 Jan 2026).

6. Implications and Future Directions

MovSafeBench demonstrates that VLMs, despite high performance on canonical multimodal QA, are unsafe for deployment in real-world autonomous decision-making where rare and dynamic hazards are prevalent. The dataset's design (large-scale, systematically edited, scenario- and object-diverse) establishes stringent requirements for future model alignment and robustness in safety-critical navigation. A plausible implication is the urgent need for new training regimes and architectural interventions targeting causal motion understanding, anomaly detection, and spatial-temporal generalization under environmental uncertainty. MovSafeBench thus defines a new empirical standard for evaluating VLMs in mobile-agent safety domains (Taniguchi et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MovSafeBench.