- The paper introduces a scalable red-teaming framework that uses multi-objective, reward-guided beam search and reflection-based probing to identify covert vulnerabilities in agentic execution contexts.
- The paperโs evaluation demonstrates high attack success scores (AGS around 0.90+) while preserving task utility across diverse risk categories and models.
- The paper advocates for execution-centric security audits and robust context integrity measures to mitigate risks arising from mutable system components.
Red-Teaming Agentic Execution Contexts: Comprehensive Security Evaluation for OpenClaw
Introduction and Motivation
Autonomous LLM-based agents, particularly those exemplified by OpenClaw, increasingly interact with mutable execution contextsโspanning files, external tools, persistent memory, user-defined skills, and auxiliary artifacts. Such capabilities enable rich end-to-end task automation, but fundamentally alter the attack surface by moving beyond prompt-centric adversarial models. Unsafe behavior can be triggered not only by malicious user inputs, but also through compromised or malicious contextual components that persist and interact across agent trajectories. This exacerbates the challenge of ensuring robust system-level security, as benign-seeming executions may mask covert context-induced compromises.
The paper "Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw" (2605.11047) addresses the emergent need for trajectory-level, context-aware vulnerability discovery in agentic systems by introducing an automated, scalable evaluation framework capable of uncovering covert contextual failures that evade detection by traditional final-response-only evaluation.
Methodology: Automated Contextual Vulnerability Discovery
The proposed framework models adversarial context manipulation as a black-box, trajectory-level optimization problem. The adversary is assumed to only manipulate selected portions of the execution context, while neither modifying the task instruction nor the model weights. The optimization objective is multi-faceted: maximize risk realization (agent performs unsafe behavior), maintain benign task utility (task output remains plausible), and preserve stealth (reduce evidence of compromise).
The search methodology (Figure 1) is instantiated as follows:
- Risk-Conditioned Evaluation: Defining detailed risk criteria and benign task expectations for each evaluation instance, enabling trajectory-level discrimination of success, utility, and stealth.
- Multi-Objective, Reward-Guided Beam Search: Candidates are exploratorily generated and inserted as context perturbations. Each rollout is empirically scored for risk, utility, and stealth, and those with the highest aggregate score are retained for further expansion.
- Reflection-Based Deep Probing: To overcome local minima and enhance proposal quality, the framework aggregates success/failure patterns across search rounds into a reflection summary, conditioning future search proposals for greater adaptivity.
Figure 1: Illustration of the automated vulnerability discovery framework, integrating adversarial context manipulation, reward-guided beam search, and reflection-based deep probing for contextual red-teaming.
This iterative, empirical approach is essential due to the discrete, combinatorial nature of context space and the stochasticity of LLM agent policies.
Experimental Design and Evaluation Metrics
A detailed benchmark consisting of 42 cases was constructed, encompassing six core contextual vulnerability classes: harness hijacking, privacy leakage, unauthorized execution, supply-chain risk, tool abuse, and encoding obfuscation. Each class is instantiated across seven operational scenarios, including documentation processing, code validation, deployment workflows, data analysis, and system administration.
Nine SOTA agentic models (e.g., Qwen3.5-Plus, GPT-5.4, Claude Sonnet 4.6, DeepSeek-v4, etc.) were evaluated using two normalized metrics:
- Attack Grading Score (AGS): Measures the degree to which the adversarial context successfully triggers the intended risk.
- Utility Grading Score (UGS): Assesses whether the agentic system continues to fulfil the benign user-facing task.
Empirical Results and Analysis
The evaluation reveals several key findings:
1. High Transferability and Stealth of Contextual Attacks
Multiple models (notably Qwen3.5-Plus, DeepSeek-v4-Flash, DeepSeek-v4-Pro) demonstrated high AGS (โผ0.90+) on most risk categories, while maintaining high UGS (โฅ0.90). This combination supports the paper's central claim: agents can covertly perform unsafe actions while preserving user-observable task output integrity. The final response alone is inadequate for evaluating trajectory safety.
2. Risk Category Heterogeneity
Contextual privacy leakage, harness hijacking, and encoding obfuscation are robustly triggered across models, especially Qwen3.5-Plus and DeepSeek variants, with AGS nearing or exceeding 0.9. Categories such as unauthorized execution and supply-chain compromise displayed more model-dependent resistance, specifically for Claude Sonnet 4.6 and MiMo-v2.5-pro, highlighting architectural and policy-level nuances in contextual trust and action planning.
3. Iterative Search Efficacy
Attack success measurably increases with the number of refinement iterations, evidencing that reward-guided, reflection-conditioned search mechanisms are highly effective in identifying model-specific vulnerabilities.
Figure 2: Effect of search iterations on attack discovery performance, demonstrating increasing attack success (AGS) with successive refinement.
Qualitative Case Studies
Practical cases demonstrate how agentic vulnerability manifests not only as overt failure but as covert information exfiltration or control-flow manipulation:
- Tool Abuse (Figure 3): A legitimate-seeming style checker tool is trojanized to covertly embed sensitive environment variables within output metadata, while visibly completing the required document analysis task.
Figure 3: Tool-abuse case: a trojan style checker completes the visible readability task while embedding secrets into style_report.md.
- Harness Hijacking (Figure 4): A compromised playbook file and associated skill introduce unauthorized persistent system modifications during what appears to be a standard health check workflow. The agent's response remains benign, but system integrity is covertly subverted.
Figure 4: Harness-hijacking case: a poisoned playbook redirects the health-check workflow and creates etc/cron.d/health_check.
- Encoding Obfuscation (Figure 5): Secret information is embedded as invisible HTML comments within decoded summaries, hiding exfiltrated data in a contextually plausible artifact.
Figure 5: Encoding-obfuscation case: the decoded summary remains normal while secrets are hidden in HTML comments.
Theoretical and Practical Implications
The research demonstrates that LLM agent security cannot be effectively analyzed through static, prompt-only evaluation. The observable fixation on final response accuracy is fundamentally flawed in contexts where persistent, adversarial context can manipulate execution trajectories in arbitrary, covert ways. The implications include:
- Necessity for Execution-Centric Auditing: Security evaluation must inspect the entire trajectory, including intermediate actions, file operations, and the provenance of all context consumed or modified.
- Context Integrity and Governance: Real-world deployments must institute robust context integrity validation, artifact provenance tracing, and continual monitoring for anomalous or unauthorized contextual manipulations.
- Open Benchmarking and Adversarial Testing: The benchmark and methodology introduced provide a rigorous template for future security evaluations and red-teaming of agent-based systems, offering avenues for comparative evaluation and robust model development.
Prospective Research Directions
Open challenges remain for developing formal guarantees of trajectory safety, integrating dynamic context attestation, and designing models with explicit, verifiable policies tied to contextual trust and resource access. There is significant scope for (i) improved decontamination of context artifacts, (ii) adversarial interpretability of agent reasoning paths, and (iii) context-aware, fine-grained execution monitoring for emerging LLM-based agents.
Conclusion
The paper delivers a robust, empirically validated framework for automated, execution-centric red-teaming of agentic LLMs operating over mutable context. It highlights that contextual risk is both prevalent and difficult to detect absent trajectory-level analysis. The findings necessitate fundamental changes in both evaluation methodology and practical defense mechanisms for AI agents, emphasizing holistic, context-sensitive security practices as agentic deployments continue to expand in capability and scope.