Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

Published 6 Apr 2026 in cs.CR, cs.AI, and cs.CL | (2604.04759v1)

Abstract: OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces the CIK taxonomy to classify persistent state poisoning in AI agents, showing attack success rates spike from a 24.6% baseline to as high as 89.2%.
The study experimentally evaluates attacks across four leading LLMs, demonstrating that poisoning any single dimension of Capability, Identity, or Knowledge significantly raises threat levels.
Results highlight an evolution–safety tradeoff, revealing that current defenses can reduce but not eliminate the risks of persistent state modifications without hindering adaptive functionality.

Real-World Safety Analysis of Persistent State Attacks in OpenClaw

Introduction

"Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw" (2604.04759) presents the first empirical, real-world evaluation of persistent state poisoning threats in personal AI agents, instantiated on OpenClaw—a platform with full system privileges and integrations spanning critical resources including Gmail, Stripe, and the local file system. The study’s core contribution is the CIK taxonomy, which unifies agent vulnerabilities into three failure dimensions: Capability, Identity, and Knowledge—each corresponding to a distinct aspect of the persistent agent state. The paper systematically demonstrates that state-poisoning attacks on any CIK dimension dramatically amplify the attack success rate (ASR) against advanced LLM backbones, and further that these vulnerabilities are architectural, not model-specific.

The CIK Taxonomy and Attack Vectors

The CIK model decomposes the persistent, self-evolving state of agentic systems into:

Capability: Executable skills (script and documentation files) that define agent actions.
Identity: Persona, configurations, and trust anchors (e.g., who the agent trusts).
Knowledge: Long-term memories and behavioral records.

The persistent state files are not ephemeral—they are loaded and further modified automatically across interactions, providing a high-value target for adversaries who achieve even temporary access.

Figure 1: Persistent state in OpenClaw is organized along the CIK axes, each enabling distinct and highly effective poisoning mechanisms, tested across four backbone LLMs.

The attack protocol spans two temporally-separated phases: (1) injection into a target file (persistent poisoning), and (2) a subsequent session in which the malicious state is triggered by an external prompt or event, yielding real-world harm.

Figure 2: Two-phase attack protocol—state poisoning (phase one) is decoupled from downstream triggering (phase two), which ensures persistence beyond the initial intrusion window.

Experimental Study: Poisoning Efficacy and Defense Shortcomings

The CIK taxonomy is realized concretely on OpenClaw, with direct mappings from each dimension to persistent workspace files. The experimental suite covers 12 real-world harms split across privacy violations and irreversible destructive actions (e.g., credential exfiltration, unauthorized refunds, workspace destruction).

Quantitative experiments were conducted across four leading LLM backbones: Claude Sonnet 4.5, Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4. Each was subject to 88 test conditions, encompassing both clean baselines and isolated CIK poisoning attempts.

Results: Structural Vulnerability

Key finding: Poisoning any single CIK dimension increases mean ASR from a baseline of 24.6% to the 64–74% range. Sonnet 4.5 exhibits ASRs as high as 89.2% post-Knowledge poisoning; even the most robust model (Opus 4.6) more than triples its baseline ASR when poisoned.

These results are robust across both injection (file modification) and trigger (action execution) phases. Notably, executable Capability payloads (e.g., hidden shell scripts) bypass LLM reasoning entirely, resulting in deterministic attack success except for partial resistance in GPT-5.4.

Case Studies: Exploitation of CIK Dimensions

To exemplify the operational mechanics and impact, the study provides concrete poisoning scenarios for each CIK axis:

Figure 3: Concrete attacks—(Left, Knowledge): Fabricated batch refund “habits” make dangerous financial actions seem routine. (Middle, Identity): A planted backup destination silently redirects credentials. (Right, Capability): Hidden destructive payloads effect irreversible workspace loss.

Knowledge: False behavioral histories (e.g., in MEMORY.md) cause the agent to treat risky operations—such as batch refunds—as routine and uncontested.
Identity: Malicious anchors (e.g., a backup URL in USER.md) are accepted as trusted without confirmation, enabling data exfiltration.
Capability: Executable payloads inserted into skill files are executed silently, enabling total compromise or denial-of-service (e.g., rm -rf).

Defense Evaluation and Evolution–Safety Tradeoff

Three defense strategies—each aligned to a CIK dimension—were quantitatively evaluated. While defenses reduce baseline ASR, none are sufficient to eliminate attacks within their targeted dimension, and Capability-targeted attacks remain resistant (ASR: 63.8% even under the strongest defense with GuardianClaw loaded pre-session).

A file-protection strategy drastically reduces both attack and legitimate update acceptance rates (~97% reduction in both), surfacing a fundamental evolution–safety tradeoff: mechanisms that block attacks also paralyze the agent’s adaptive functionality, which is intrinsic to its value proposition.

Theoretical and Practical Implications

This work’s central implication is that persistent state poisoning in LLM-driven agents is an architectural weakness intrinsic to the “evolution-first” paradigm. Improvements in backbone model alignment are insufficient: attacks that exploit out-of-band code paths (Capability) or socially-engineered trust (Identity) persist across even the current model frontier.

The inability of LLMs to reliably differentiate benign from malicious persistent state changes (especially as context size/complexity grows) suggests that prompt/knowledge-level safety interventions will remain incomplete. Instead, systematic architectural mitigations—runtime code signing, sandboxed skill execution, mandatory human-in-the-loop approval for persistent state changes, and explicit separation of agent memory from executable privileges—appear essential for robust security.

Discussion and Future Directions

By providing a unified taxonomy and a live-systems evaluation, this study sets a new standard for agentic security benchmarking. The CIK decomposition generalizes to all self-evolving LLM agents, and its demonstration on OpenClaw (with >220,000 internet-exposed instances) (2604.04759) illustrates immediate risk at scale.

Future work should extend to chained, multi-axis attacks, longitudinal naturalistic deployments, automated attack generation, and evaluation of systematic architectural defenses (e.g., OS-level sandboxing as opposed to LLM-based refusals). Production-grade agent deployments integrating CIK-aware safety architectures are required to address these vulnerabilities at ecosystem scale.

Conclusion

This paper demonstrates that persistent state poisoning—across Capability, Identity, and Knowledge—is a structural, cross-model threat to all evolution-first LLM-based agents. The current generation of defenses is insufficient, and more systematic, architectural safeguards are indispensable to reconcile personalization with real-world safety. The CIK taxonomy provides a foundation for future evaluation, mitigation, and design of secure agentic systems.

Markdown Report Issue