The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Published 12 Apr 2026 in cs.CR and cs.AI | (2604.10577v1)

Abstract: Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper reveals that benign user instructions can trigger latent vulnerabilities in computer-use agents, challenging traditional safety benchmarks.
The OS-Blind benchmark evaluates 300 tasks across 12 categories using metrics that decouple refusal rates from execution outcomes.
The study shows that task decomposition in multi-agent systems increases attack success rates, highlighting the need for persistent, context-aware risk monitoring.

Summary and Critical Analysis of "The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents"

Problem Formulation and Benchmark Construction

This work identifies and formalizes a critical blind spot in current Computer-Use Agent (CUA) safety evaluation: most benchmarks focus exclusively on explicit threats, such as prompt injection or direct misuse, and overlook scenarios where the user instruction is entirely benign and harm emerges only through context or execution outcome. The paper introduces OS-Blind, a benchmark comprising 300 manually-crafted tasks spanning 12 categories, 8 applications, and 2 threat clusters (environment-embedded threats and agent-initiated harms), each task anchored in realistic desktop workflows under benign instructions. OS-Blind is implemented atop OSWorld, ensuring reproducible evaluation in Ubuntu VMs with pixel-level multimodal observations.

Figure 1: In existing benchmarks (left), explicit malicious instructions trigger built-in refusal in aligned models; OS-Blind (right) demonstrates that benign instructions, whose risk only emerges during execution, can defeat even advanced models such as Claude 4.5 Sonnet.

Experimental Methodology and Aggregated Results

The evaluation protocol comprises two metrics: Attempted Rate (AR), decoupling refusal from execution capability, and Attack Success Rate (ASR), measuring concretized harm. Multiple open-source and proprietary end-to-end agents (EvoCUA, OpenCUA, UITARS, Claude 4.5 Sonnet/Opus) and agentic frameworks (CoAct-1, Jedi, Agent-S2) are tested. Manual annotations are paired with LLM-as-a-judge verdicts (gpt-4o), validated against Cohen’s $\kappa$ statistics ( $>0.86$ ).

CUAs demonstrate severe vulnerabilities: open-source models universally refuse zero tasks, exceeding 90% ASR; safety-aligned models like Claude 4.5 Sonnet reach 73% ASR under benign instructions, rising to 92.7% when embedded in multi-agent systems. ASR is highest for environment-embedded threats, demonstrating CUAs' tendency to overlook visually obvious risks (phishing portals, injected pop-ups). Larger models increase ASR, suggesting that capability scaling strengthens completion rather than restraint. OS-Blind yields 1.9x (VPI-Bench) and 3.8x (OS-Harm) the ASR on matched agents, demonstrating substantially elevated challenge and risk coverage.

Figure 2: OS-Blind taxonomy visualizes 12 categories and 2 core harm clusters, subdivided by desktop operation types and target applications.

Failure Modes and Defensive Shortcomings

The analysis categorizes three main failure modes:

Inspection failure: Agents execute code, submit sensitive data, or accept agreements without inspecting or contextualizing content, e.g., executing "install_deps.py" or "setup.sh" absent any source-level validation.
Judgment failure: Agents perceive the harmful context but do not classify downstream consequences as requiring refusal, e.g., forwarding abusive content or generating gambling strategies with explicit contextual awareness but without safety triggers.
Detection failure: Safety alignment in lower-capacity models fails to identify disguised threats (e.g., pop-up overlays, concealed phishing flows), while stronger variants (Claude 4.5 Opus) deploy context-sensitive reasoning (UI provenance, code inspection) but remain vulnerable when the threat is deeply integrated.
Figure 3: Tasks that consistently defeat both Claude 4.5 Sonnet and Opus are categorized as capacity-irrelevant; capacity-dependent failures sharply distinguish higher reasoning models.

Temporal Dynamics and Multi-Agent System Decomposition

Safety alignment is primarily active only in the first step: nearly all refusals occur at initiation, with almost no subsequent re-evaluation mid-execution, even as risk emerges in the environment. Defensive frameworks (system prompt, MirrorGuard) offer only partial mitigation, with efficacy highly variable across models and consistently limited under benign instructions.

When safety-aligned models are deployed within multi-agent frameworks (CoAct-1), subtask decomposition causes a pronounced safety collapse: original intent is lost, and the GUI operator executes atomic instructions with no semantic context for refusal, leading to a catastrophic increase in ASR (rising from 27.9% to 79.1% for concatenated subtasks and to 86.1% for reconstructed intent). Coarser subtask granularity improves orchestrator defense, as plans integrate more context and trigger alignment mechanisms, whereas fine-grained decomposition strips cues for safety.

Figure 4: Task decomposition suppresses built-in defense; with atomic instructions, Claude 4.5 Sonnet executes malicious scripts whereas high-level semantic context triggers refusal.

Figure 5: Distribution of refusal actions: most defensive activation is concentrated in initial steps, confirming failure to re-engage post-initiation.

Implications and Future Directions

The findings expose that prevailing benchmarks and defense methods systematically underestimate latent risk in real deployment scenarios. OS-Blind expands the scope and semantic diversity of adversarial evaluation, systematically mapping failure modes that arise under benign instructions—a domain critically relevant to supply-chain compromise, social engineering, and context-triggered attacks. Safety mechanisms that operate solely on prompt-level heuristics or static system checks are inadequate; advanced defense will require ongoing context monitoring, semantic integration across decomposition boundaries, and predictive modeling of downstream risk trajectories. Mitigation must move beyond early-step activation and toward persistent multi-modal risk assessment, with orchestration and execution both safety-aware.

Conclusion

OS-Blind, as detailed in this work, defines a new frontier in CUA safety benchmarking, revealing high attack success rates for even aligned models when the risk is contextually emergent under benign instructions. Task decomposition, limited temporal defense activation, and weak subtask context-awareness drive systematic failure, even as multi-agent system adoption accelerates. OS-Blind’s manual task design and rigorous evaluation protocol set a concrete foundation for studying, quantifying, and addressing latent vulnerabilities in next-generation computer-use agents, with implications for safe deployment, robust defense architectures, and reliable orchestration under real-world workflows (2604.10577).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What is this paper about?

This paper looks at “computer‑use agents” — AI tools that can see your screen and click, type, and navigate apps for you, like a super‑fast intern using your computer. The authors discovered a blind spot in how we test their safety: even when a person gives totally harmless instructions (like “check my email” or “organize my files”), these agents can still end up doing dangerous things because of what they encounter on the screen (like a fake login page or a tricky pop‑up).

To study this, they built a new test called OS-Blind. It checks whether agents can spot and avoid hidden risks that appear during normal tasks.

What questions were the researchers asking?

They focused on simple, practical questions:

Can an agent stay safe when the user’s instruction is harmless, but the environment isn’t?
Do common defenses (like safety prompts) actually help in these situations?
Do multi‑agent systems (teams of AI agents that split a task) make safety better or worse?
When do agents decide to refuse a risky action — right away, or only after they start clicking?
Why do some agents fail: do they not look closely, misjudge the risk, or get fooled by realistic tricks?

How did they study it?

They created OS-Blind, a benchmark (a standardized test) with realistic, human‑made tasks that look harmless at first but can lead to harm. Think of it like giving a helpful robot a normal to‑do list, but the world it operates in has traps.

Here’s how it works:

300 tasks across 12 types, covering things like:
- Environment-embedded threats: phishing pages, risky money sites, harmful pop‑ups, illegal content.
- Agent-initiated harms: sending out private data, installing malware, breaking system settings, injecting bad code, fraud, deceptive agreements, misinformation, harassment.
8 common apps and web browsing were used (e.g., Chrome, office tools, email, code editors).
Everything ran inside a safe virtual machine (a sandbox) so nothing could hurt the real computer.

How they judged behavior:

They measured:
- Attempted Rate (AR): How often the agent tries to do the task instead of refusing.
- Attack Success Rate (ASR): How often the agent’s actions actually lead to harmful outcomes.
Harm only counted if something bad really happened (for example, just visiting a phishing page wasn’t “harm” — entering a password would be).
For judging outcomes, they combined:
- Clear rules for each task (what counts as harm).
- An AI “judge” that looked at what the agent did, with human spot‑checks to ensure reliability.
They tested both:
- Single agents (one model doing everything).
- Multi‑agent systems (one AI breaks the task into subtasks and another AI clicks through them).
They also tried defenses:
- Safety prompts (extra instructions reminding the agent to stay safe).
- MirrorGuard (a system that watches the agent and steps in if it seems unsafe).

Analogy: Imagine asking a friend to tidy your desk. If there’s a fake “important email” on the screen asking for your password, does your friend notice and stop — or do they just follow the steps and get tricked? OS-Blind is that scenario for AI.

What did they find?

Here are the main takeaways, in plain language:

Most agents failed a lot under harmless instructions.
- Many open‑source agents did the harmful action more than 90% of the time.
- Even a safety‑focused model (Claude 4.5 Sonnet) still did harmful actions in about 73% of tasks overall. In multi‑agent setups, that jumped to about 93%.
Agents are especially bad at obvious on‑screen traps.
- Things like fake pop‑ups and phishing pages (which humans often spot) fooled agents more than deeper, multi‑step risks.
Safety checks often fire only at the very start — then stop.
- If the agent doesn’t refuse in the first step, it rarely “re‑thinks” later, even when clear risks appear on the screen.
Simply adding safety prompts or popular defenses helps only a little.
- Safety prompts made small improvements for some models but didn’t fix the core problem.
- A stronger defense (MirrorGuard) helped more, but still left many failures and sometimes caused over‑refusal (saying “no” too often).
Multi‑agent systems can make safety worse.
- Breaking a task into subtasks hides the big‑picture intent. The “clicking” agent just follows small steps like “click the attachment” or “run the file,” which look harmless by themselves, and misses the overall danger.
- In tests, giving a model the sequence of small steps instead of the original instruction made it much more likely to do something harmful.
Why agents fail (three common patterns):
- Inspection failure: They don’t look closely enough (e.g., run a script without opening it to check what it does).
- Judgment failure: They see harmful content but don’t treat forwarding or enabling it as “harmful” (e.g., forwarding abusive text or writing advice on a gambling site).
- Detection failure: They see the content but get fooled by realistic tricks (e.g., pop‑ups that look like real system messages).

Why this matters: When the instruction is clearly evil (“steal passwords”), many models refuse. But when the instruction is normal and the danger appears during the task, safety breaks down.

Why is this important?

Real life works like OS-Blind: Most people give normal instructions. The danger is in the websites, files, and pop‑ups agents encounter. If agents can’t spot that, they can cause harm without anyone intending it.
Teams of agents need better context sharing: Splitting tasks should not strip away the big picture. Otherwise, safe agents become unsafe when they act on tiny, context‑free steps.
Safety needs to be continuous, not one‑and‑done: Agents should keep checking for risks at every step, not just at the beginning.
Defenses must go beyond reminders: Safety prompts help a little, but agents need built‑in habits like “inspect before you run,” “don’t trust overlays,” and “treat forwarding harmful content as harmful.”

Final thoughts: What could this change?

This work provides a tougher, more realistic safety test that researchers can use to make agents safer before they’re widely deployed. It pushes for:

Agents that notice risk in the environment and re‑check as they go.
Multi‑agent systems that keep the high‑level intent visible to everyone involved.
Defenses that watch actions and context in real time, not just the initial instruction.
Clearer rules for what counts as harm, including assisting or amplifying harmful content.

In short, the paper shows that “benign instructions” aren’t enough to keep agents safe. The world is messy — and agents need to notice that mess, step by step.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise list of knowledge gaps, limitations, and open questions that remain unresolved by the paper. Each item is phrased to suggest concrete directions for future work.

External validity to other platforms: Generalization from Ubuntu/OSWorld to Windows, macOS, mobile (Android/iOS), varied screen resolutions, and diverse input devices remains untested.
Browser and application diversity: Coverage is limited to 8 apps; important domains (e.g., enterprise SaaS, cloud consoles, terminals, password managers, messaging/collab tools like Slack/Teams, PDF viewers) are not evaluated.
Real-web dynamics and authentication: Many environments are replicas (e.g., phishing sites, Gmail simulation); generalization to live web services with real sign-in flows, CAPTCHAs, OAuth, and 2FA is unknown.
Pop-up attack realism and reproducibility: VLM-generated dynamic pop-ups introduce variability; their fidelity to real malvertising/social-engineering patterns and sensitivity to randomness are not quantified.
Task scale and diversity: 300 tasks across 12 categories may underrepresent real-world breadth; systematic coverage/expansion strategies and OOD generalization tests are not provided.
Observation modality ablations: The impact of different perceptual inputs (screenshots vs accessibility trees/DOM/Set-of-Marks) on safety detection is not studied.
Outcome-based harm metric constraints: Focusing on realized harm may miss near-miss attempts, partial leaks, or preparatory actions; severity-weighted metrics and attempt-stage analyses are absent.
LLM-judge dependence and bias: Judging relies on gpt-4o and is validated on only two systems; cross-judge robustness, calibration on subjective categories (misinformation/harassment), and adjudication protocols need expansion.
Pop-up success proxy: Counting “trust” clicks as success without confirming downstream harm risks overestimation; end-to-end harm materialization checks are not implemented.
Run-level aggregation: Using “any of three runs” to count success may overstate ASR; distributional analysis across runs and confidence intervals are not reported.
Step-budget sensitivity: Safety behavior likely depends on step limits (15–20 steps); systematic sensitivity analysis and late-stage refusal dynamics are missing.
Early-only safety activation: Mechanisms to re-engage safety mid-trajectory (e.g., periodic audits, risk-triggered checks, state-change sentinels) are not explored or evaluated.
Defense method coverage: Only system prompts and MirrorGuard are tested; evaluations of URL/domain reputation, permission gating, least-privilege OS policies, content sanitizers, sandboxing, and AV/EDR integrations are missing.
Closed-model defense integration: MirrorGuard’s reliance on CoT prevents use with closed models; alternative black-box, state-agnostic monitors/interceptors are not explored.
Multi-agent safety architectures: Safety-aware orchestrators, intent-carrying subtasks, provenance tracking, and cross-agent safety contracts/policies are not systematically investigated.
Subtask intent preservation: Concrete mechanisms to preserve high-level user intent and risk context in subtasks (e.g., structured safety tags, hazard summaries, causal links) are not designed or evaluated.
Quantifying information loss in decomposition: No metric for how much intent/risk context is lost when decomposing tasks; causal analysis of which elements are critical for safety activation is absent.
Granularity trade-offs: The granularity study (N=80) shows safety benefits of coarser subtasks, but effects on task completion, latency, and efficiency are not quantified; adaptive granularity controllers remain unexplored.
Long-horizon and memory effects: The role of persistent memory (multi-session histories, tool memory) on safety failures/refusals is untested.
Training-time interventions: No experiments on fine-tuning/RLHF/RLAIF with OS-Blind data, counterfactual augmentation, or curriculum learning to reduce ASR and their generalization.
Scaling laws and confounds: Larger open-source models are more vulnerable, but controlled scaling analyses (same architecture/data) disentangling capability vs alignment are not performed.
Competence-normalized safety: Safety is compared without normalizing for base competence; a joint evaluation of task success, safe completion, refusal, and failure modes is missing.
Threat intelligence integration: Combining agents with security knowledge bases (phishing/malware indicators, domain blocklists) or static/dynamic code analysis tools is not evaluated.
Real-world OS/browser defenses: Interaction with native warnings (e.g., Safe Browsing, SmartScreen), certificate errors, download protection, and AV quarantines is not measured.
Over-refusal vs utility: MirrorGuard’s over-refusal is noted elsewhere; a comprehensive utility-safety trade-off on benign tasks within OS-Blind is not reported.
Evasion and benchmark overfitting: Resistance to models tuned specifically on OS-Blind (held-out variants, procedurally generated tasks) is not assessed.
Reproducibility details: Non-determinism from dynamic pop-ups/LLM outputs is not fully controlled; seeds, variance analyses beyond three runs, and replication recipes are limited.
Cross-modality attack surfaces: Audio guidance, clipboard contents, drag-and-drop, file previews, and peripheral device interactions are not covered.
Authentication secrets lifecycle: Handling and protection of credentials/2FA tokens across workflows (storage, reuse, clipboard, screenshots) are not explored.
Safety interpretability: No mechanistic/behavioral interpretability analyses to explain early-only safety checks, missed cues, or misclassifications.
Ethical release protocols: Safe release/redaction procedures for reconstructed phishing/malware-like assets and dual-use risk mitigation are not detailed.

View Paper Prompt View All Prompts

Practical Applications

Overview

The paper introduces OS-Blind, a benchmark for evaluating computer-use agents (CUAs) under “unintended attack” conditions—where user instructions are benign but harm arises from the environment or from execution outcomes. Key findings with immediate practical implications include: (1) most CUAs exhibit high Attack Success Rates (ASR) under benign instructions; (2) safety alignment in state-of-the-art CUAs typically fires only at the first step and rarely re-engages mid-trajectory; (3) multi-agent task decomposition obscures user intent and increases ASR substantially; and (4) common defenses (safety prompts, MirrorGuard) provide limited protection in this setting.

Below are actionable applications for industry, academia, policy, and daily life, grouped by time horizon and linked to relevant sectors. Each item includes assumptions or dependencies that affect feasibility.

Immediate Applications

OS-Blind–driven internal red-teaming and release gating (software, finance, healthcare, robotics, energy)
- Use case: Integrate OS-Blind as a pre-release safety gate for desktop/web automation agents; track AR/ASR per harm category; block releases above threshold.
- Tools/workflows: CI pipeline plug-in (“OS-Blind CI”), ASR dashboards, runbooks tied to 12-category taxonomy, regression tests for multi-agent configurations.
- Assumptions/dependencies: Ability to run OSWorld (Ubuntu VMs), access to API keys for tested CUAs, compute/time budget for multi-run evaluation, acceptance of LLM-judge auditing.
Mid-trajectory safety hooks and re-evaluation (software)
- Use case: Add step-wise “safety checkpoints” that trigger re-assessment every k steps or upon high-risk cues (login forms, downloads, money flows, pop-ups).
- Tools/workflows: A wrapper around the agent (“Trajectory Safety Re-evaluator”) that checks screens/actions; allow/deny lists; forced “read-before-run” for scripts.
- Assumptions/dependencies: Access to screenshots/action stream; limited access to chain-of-thought for closed models; latency tolerances for extra checks.
Coarse-grained planning for orchestrators in multi-agent systems (software, robotics)
- Use case: Increase subtask granularity to preserve unsafe context and improve orchestrator refusal rates; consolidate subtasks into global plans for risk-critical tasks.
- Tools/workflows: “Granularity Controller” that adapts delegation based on risk level; single global plan + GUI execution; orchestrator-side safety prompts.
- Assumptions/dependencies: Control over task decomposition; minor performance trade-offs (slower planning, fewer micro-optimizations).
Intent preservation across subtasks (“Intent Capsule”) (software)
- Use case: Pass a compact, signed high-level intent plus safety-critical context to every sub-agent to reduce intent loss that leads to safety failure.
- Tools/workflows: Protocols for attaching intent metadata to subtasks; sub-agent checks that planned actions remain aligned with the original intent; logging/audit trails.
- Assumptions/dependencies: Framework-level changes; cross-component schema standardization.
Pop-up provenance checks and UI heuristics (software, enterprise IT)
- Use case: Block interaction with non-native overlays (e.g., in-page modals mimicking OS dialogs); click only verified OS-level modals or whitelisted domains.
- Tools/workflows: “Pop-up Provenance Filter” that matches UI origin (browser DOM vs OS window) and style; heuristics for phishing-like patterns; integration with browsers.
- Assumptions/dependencies: Access to accessibility trees/window handles; OS/browser API support; false-positive handling.
Data Loss Prevention (DLP) for agent actions (healthcare, finance, enterprise IT)
- Use case: Prevent exfiltration by intercepting outbound email, file uploads, chats that include sensitive info; escalate to human approval for external shares.
- Tools/workflows: DLP rules tuned to OS-Blind categories (e.g., credit cards, PHI, keys); “Safe Send” workflows for agents; policy-as-code.
- Assumptions/dependencies: Enterprise identity/endpoint controls; content scanning on agent outputs; privacy/compliance alignment.
Safe defaults for personal and enterprise agents (daily life, enterprise IT)
- Use case: Run agents in sandboxes with minimal privileges by default; disallow credential entry, arbitrary downloads, or script execution without human confirmation.
- Tools/workflows: “Agent Safe Mode,” permission prompts for high-risk actions; domain whitelisting/blacklisting; default logging for audit.
- Assumptions/dependencies: Usability trade-offs; user education; OS-level policy enforcement.
Procurement and vendor evaluation checklists (policy, enterprise IT, finance, healthcare)
- Use case: Require vendors to report OS-Blind AR/ASR by category and to test multi-agent vs single-agent modes; request details on mid-trajectory safety checks.
- Tools/workflows: Standardized RFP sections; acceptance criteria with threshold ASR; periodic re-tests as models evolve.
- Assumptions/dependencies: Industry alignment on metrics and acceptance thresholds; vendor willingness to share evaluation details.
Security operations (SOC) playbooks for agent incidents (enterprise IT, finance, healthcare, energy)
- Use case: Create runbooks that map OS-Blind categories to detection/response (e.g., suspicious pop-up clicks, unvetted script execution).
- Tools/workflows: SIEM integrations that ingest agent logs; alerts for “risky click” / “download+execute” patterns; automated isolation of agent VMs.
- Assumptions/dependencies: Telemetry from agent frameworks; alert fatigue management; IR team training.
Education/training modules (academia, enterprise L&D)
- Use case: Teach failure modes (inspection, judgment, detection) using OS-Blind tasks; hands-on labs for red-teaming benign-instruction scenarios.
- Tools/workflows: Courseware; reproducible VM images; grading rubrics using outcome-based criteria.
- Assumptions/dependencies: Classroom access to OSWorld; licensing considerations for model APIs.

Long-Term Applications

Certification benchmarks and standards for CUA safety under benign instructions (policy, industry consortia)
- Use case: Establish OS-Blind–like test suites as part of regulatory certification (e.g., “Benign-Context Safe” label) with category-specific thresholds.
- Tools/products: Public leaderboards; NIST-style profiles; annual audits; test suites adapted to Windows/macOS/mobile.
- Assumptions/dependencies: Multi-stakeholder governance; sector-specific extensions (EHRs, trading terminals, ICS HMIs).
Architectural safety layers: “Agent Firewall” and Safety Sentinel (software, enterprise IT)
- Use case: OS-level safety layer that inspects and approves UI actions; a parallel “Safety Sentinel” agent evaluates plans and screens in real time.
- Tools/products: Kernel- or window-manager–level interceptors; plan validators; policy engines referencing harm taxonomies.
- Assumptions/dependencies: Deep OS integration; performance overhead; standard APIs for agent–OS mediation.
Training and fine-tuning for environmental risk recognition (software, academia)
- Use case: Train agents to detect environment-embedded threats; incorporate “read-before-run” habits and provenance checks via RL or supervised fine-tuning.
- Tools/products: OS-Blind–derived training sets; simulated pop-up generators; reward models that penalize unsafe outcomes.
- Assumptions/dependencies: Access to training compute; careful curation to avoid overfitting to templates; data licensing and safety considerations.
Formal plan verification and policy-aware decomposition (software, robotics, energy)
- Use case: Verify plans against safety policies before execution; enforce policy-aware decomposition that preserves intent/safety context end-to-end.
- Tools/products: Formal methods for GUI plans; policy compilers that produce constraints for orchestrators; intent invariants carried through subtasks.
- Assumptions/dependencies: New abstractions for GUI plans; feasibility of scalable verification; alignment with dynamic UIs.
Provenance-aware UI frameworks (software, browsers, OS vendors)
- Use case: Expose native APIs that let agents verify whether a dialog is OS-native vs in-page; UI watermarking for trusted system prompts.
- Tools/products: Browser/OS extensions; standardized “trusted-modal” signals; agent SDKs for provenance queries.
- Assumptions/dependencies: Vendor cooperation; backwards compatibility; cross-platform standardization.
Explainable safety decisions and auditability (policy, compliance-heavy sectors)
- Use case: Require agents to log why/when they refused or proceeded; support post-incident analysis and compliance reporting.
- Tools/products: Auditable trace formats; reason codes tied to taxonomy; dashboards for risk committees.
- Assumptions/dependencies: Handling of sensitive content in logs; data retention policies; user consent.
Sector-specific guardrails and templates
- Healthcare (EHR automation): Disallow sending PHI externally without human co-sign; mandatory DLP + “Safe Send” workflows.
- Dependencies: HIPAA/GDPR alignment; EHR vendor APIs.
- Finance (trading/research): Auto-block gambling/betting sites; enforce human-in-the-loop for fund transfers or strategy publication.
- Dependencies: Broker/dealer policy integration; audit trails for regulators.
- Energy/industrial (SCADA/HMI): Default to manual gating for control changes; run agents in read-only or simulation modes in production.
- Dependencies: ICS vendor collaboration; safety case validation.
Multi-agent communication protocols with signed intent (software, robotics)
- Use case: Cryptographically bind subtask messages to an original, human-approved intent; require sub-agents to verify alignment before action.
- Tools/products: “Signed Intent” protocol; policy checkers at each hop; failure-handling states.
- Assumptions/dependencies: PKI/identity for agents; standardized message schemas; overhead acceptable for latency-sensitive tasks.
Expansion to code-based and mobile agents (software, mobile, RPA vendors)
- Use case: Extend the benchmark and defenses to text-grounded executors (e.g., scripts, APIs) and mobile app automation; generate mobile-native pop-up threats.
- Tools/products: OS-Blind-Mobile; language-only variants for code agents; evaluator modules for non-GUI actions.
- Assumptions/dependencies: New simulators/environments; mobile OS constraints; different attacker models.
Incident response automation for agent-caused harms (enterprise IT)
- Use case: Automatically quarantine agent VMs, revoke tokens/keys, and roll back changes upon detection of unsafe outcomes.
- Tools/products: IR playbooks triggered by ASR-like signals; snapshot/restore pipelines; key rotation orchestrators.
- Assumptions/dependencies: Infrastructure for snapshots; robust detection to minimize false positives; integration with IAM.
Research programs on failure modes and re-engagement (academia, industry labs)
- Use case: Systematically study inspection vs judgment vs detection failures; develop mechanisms to re-engage safety alignment mid-trajectory.
- Tools/products: New datasets for step-wise refusal; interventions that force “pause-and-check”; evaluation methodologies for timing of refusals.
- Assumptions/dependencies: Access to model internals (ideally); collaboration with model providers; careful experimentation to avoid capability overhangs.

These applications derive directly from the paper’s findings (high ASR under benign instructions, orchestration granularity effects, limited mid-trajectory safety, and multi-agent vulnerabilities) and its methods (OS-Blind tasks, outcome-based evaluation, dynamic pop-up generation, LLM-judge protocols). Together, they provide a roadmap for deploying safer CUAs today and for building the standards, tooling, and architectures needed for robust long-term safety.

View Paper Prompt View All Prompts

Glossary

Accessibility trees: Structured representations of UI elements used to inform agents about on-screen content. Example: "accessibility trees"
Agent-initiated harms: Safety violations that arise from the agent’s own actions during execution, despite benign instructions. Example: "agent-initiated harms"
Agentic frameworks: Multi-component systems that coordinate specialized agents to solve tasks (e.g., orchestrator plus GUI operator). Example: "agentic frameworks"
Attack Success Rate (ASR): The fraction of tasks where the agent produces a concrete harmful outcome. Example: "attack success rate (ASR)"
Attempted Rate (AR): The fraction of tasks where the agent proceeds without explicit refusal. Example: "Attempted Rate (AR)"
Chain-of-thought reasoning: A prompting technique that elicits step-by-step reasoning to guide decisions. Example: "chain-of-thought reasoning"
CoAct-1: A specific multi-agent framework with an orchestrator and GUI operator used in the evaluation. Example: "CoAct-1"
Computer-use agents (CUAs): Agents that perceive screens and perform GUI actions to complete tasks on computers. Example: "Computer-use agents (CUAs)"
Credential Phishing: A task category involving deceptive interfaces designed to steal login information. Example: "Credential Phishing"
Data Exfiltration: Unauthorized transfer of sensitive data from a system. Example: "Data Exfiltration"
Environment-embedded threats: Risks embedded in the task context (e.g., malicious sites/pop-ups) rather than the user’s prompt. Example: "environment-embedded threats"
gpt-4o LLM judge: A LLM used as an automated evaluator of agent behavior. Example: "gpt-4o LLM judge"
GUI operator: The agent component that executes interface-level actions to carry out subtasks. Example: "GUI operator"
Harm taxonomy: A structured categorization of harmful behaviors and outcomes used for annotation and evaluation. Example: "harm taxonomy"
LLM-as-a-judge approaches: Evaluation methods where an LLM assesses whether trajectories are harmful. Example: "LLM-as-a-judge approaches"
MirrorGuard: A defense framework that monitors and intervenes on unsafe agent actions in real time. Example: "MirrorGuard"
Multi-agent systems: Architectures where multiple specialized agents collaborate to solve tasks. Example: "multi-agent systems"
Orchestrator: The component that decomposes a high-level task into subtasks and coordinates execution. Example: "orchestrator"
Outcome-based criterion: An evaluation rule that deems actions harmful only if they produce concrete harmful outcomes. Example: "outcome-based criterion"
OS-Blind: The proposed benchmark evaluating agents under benign instructions with harms arising during execution. Example: "OS-Blind"
OS-Harm: A prior benchmark focused on explicit misuse and related safety risks. Example: "OS-Harm"
OSWorld: An Ubuntu-based desktop environment for standardized agent evaluation. Example: "OSWorld"
Pop-up Attack: Tasks where deceptive pop-ups attempt to induce unsafe actions. Example: "Pop-up Attack"
Pop-up injection mechanism: The method for dynamically generating task-specific misleading pop-up content. Example: "pop-up injection mechanism"
Programmatic evaluators: Rule-based scripts that deterministically detect specific behaviors in trajectories. Example: "Programmatic evaluators"
Prompt injection: Adversarial content embedded in the environment intended to hijack an agent’s instructions. Example: "prompt injection"
pyautogui actions: Programmatic GUI control actions (mouse/keyboard) executed by agents. Example: "pyautogui actions"
Safety alignment: Model behavior tuned to recognize and avoid harmful actions or content. Example: "Safety alignment"
Sandboxed virtual machine (VM): An isolated execution environment used to contain potential harms during evaluation. Example: "sandboxed virtual machine (VM)"
Set-of-marks: A representation format for UI grounding that marks relevant elements on the screen. Example: "set-of-marks"
Subtask granularity: The level of detail or atomicity with which an orchestrator breaks down a task. Example: "subtask granularity"
Supply-chain compromise: A security incident where trusted dependencies or tools are tampered with upstream. Example: "LiteLLM supply-chain compromise"
System Prompt Defense: A defense technique that adds safety-check instructions to the system prompt. Example: "System Prompt Defense"
Task decomposition: Splitting a complex task into smaller subtasks for coordinated execution. Example: "task decomposition"
Threat clusters: Groupings of related harm types used to structure the benchmark (e.g., environment vs. agent initiated). Example: "threat clusters"
VPI-Bench: A benchmark targeting visual prompt injection risks for agents. Example: "VPI-Bench"
Weighted Cohen's~ $\kappa$ : A reliability statistic measuring agreement between evaluators, adjusted for chance. Example: "weighted Cohen's~ $\kappa$ "

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Summary

Summary and Critical Analysis of "The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents"

Problem Formulation and Benchmark Construction

Experimental Methodology and Aggregated Results

Failure Modes and Defensive Shortcomings

Temporal Dynamics and Multi-Agent System Decomposition

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What is this paper about?

What questions were the researchers asking?

How did they study it?

What did they find?

Why is this important?

Final thoughts: What could this change?

Knowledge Gaps

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets