AgentFence: Secure Autonomous Agent Framework
- AgentFence is a comprehensive methodology that defines algorithmic and architectural boundaries to secure autonomous agents in both deep learning and robotics applications.
- It introduces a taxonomy of 14 attack classes and uses a quantitative metric (Mean Security Break Rate) to assess vulnerabilities across multiple agent interfaces.
- The framework offers practical guidelines including hardening protocols, runtime DSL enforcement, and multi-agent fencing strategies to prevent unauthorized operations.
AgentFence refers both to a family of algorithmic and architectural approaches for agent security and control—spanning mobile robotics, patrolling strategies, and, more recently, the systematic evaluation and containment of advanced deep language-model agents. Across these contexts, the unifying objective is to enforce boundaries: ensuring that agents (whether physical or computational) cannot cross designated safety, authority, or operational envelopes. Modern instantiations, especially in autonomous LLM-based system security, define precise trust boundaries and measure whether adversarial influence can propagate across these interfaces to generate unsafe behaviors or system states (Puppala et al., 7 Feb 2026). AgentFence systems have been further developed in networked robotics as well as formal logic-based policy enforcement, with notable connections to runtime DSLs for constraint enforcement and guardrail methodologies in multi-agent orchestration.
1. Security Evaluation of Deep LLM Agents: AgentFence Framework
The "AgentFence" framework (Puppala et al., 7 Feb 2026) is an architecture-centric methodology for systematically mapping and quantifying security vulnerabilities in "deep" language-model agents. Unlike traditional LLM safety research that focuses on single-turn output filtering, AgentFence evaluates agents as multi-component systems spanning persistent memory modules, tool invocation routers, delegation channels, and planning/retrieval layers. The framework identifies architecture-driven attack surfaces by defining five primary trust boundaries:
- Planning → Execution (plan concretization)
- Memory ↔ Planner/Retrieval (persistent state mediation)
- Retrieval → Planner (external content ingestion)
- Tool Routing/Invocation (API/tool selection and argumentization)
- Delegation → Subagents (subtask handoff via roles)
AgentFence maintains a fixed base model—Qwen2.5-32B-Instruct in primary experiments—and evaluates only the architectural choices and interface specifications that define how agent state and authority are managed over time.
2. Attack Taxonomy and Formal Security Metrics
AgentFence introduces a taxonomy of 14 trust-boundary attack classes, each associated with a specific architectural interface. These include prompt injection (direct/indirect), state injection, tool-use hijack, retrieval poisoning, multi-agent role confusion, delegation attacks, code-execution abuse, chain-of-thought leakage, objective hijacking, denial-of-wallet, and authorization confusion, among others.
For quantitative assessment, the central security metric is the Mean Security Break Rate (MSBR):
where a break is registered on a multi-turn agent trace if any of the following predicates are detected: Unauthorized Tool Invocation (UTI), Unsafe Tool Argument (UTA), Wrong-Principal Action (WPA), State/Objective Integrity Violation (SIV), or Attack-Linked Task Deviation (ATD). Only outcome-based, trace-auditable violations count as security breaks.
All attack classes are measured under a standard configuration, with break rate thresholds set to over runs to determine material exposure.
3. Empirical Results and Comparative Architectural Insights
Systematic evaluation across eight deep-agent archetypes—LangGraph, LlamaIndex, Open-Researcher, Deep-Researcher, OpenDevin, BabyAGI, CrewAI, and AutoGPT—under persistent multi-turn workloads reveals substantial architectural impact. MSBR varies from (LangGraph, least exposed) to (AutoGPT, most exposed) on identical research tasks.
Breakdown by attack class indicates the highest risk for Denial-of-Wallet (), Authorization Confusion (), Retrieval Poisoning (), and Planning Manipulation (). Prompt-centric attacks remain below $0.20$ under standard settings. Violation composition is dominated by SIV (31%), WPA (27%), UTI+UTA (24%), and ATD (18%). Coupling analysis demonstrates that susceptibility to Authorization Confusion correlates strongly with Objective Hijacking () and Tool-Use Hijacking (), exposing a common failure mode: weak enforcement of principal- and authority-mapping across planning, memory, and tool layers (Puppala et al., 7 Feb 2026).
4. Operational Implications and Hardening Guidelines
AgentFence’s holistic, outcome-driven reframing of agent security clarifies actionable design recommendations:
- Strictly cap auto-run retries and tool budgets to contain Denial-of-Wallet.
- Segregate planner and executor roles using typed plan representations and sandboxed tool routers, mitigating Planning Manipulation and Tool-Use Hijacking.
- Enforce hardened memory read/write policies to limit belief poisoning via Retrieval Poisoning or State Injection.
- Implement principal and delegation checks with cryptographic identities (e.g., JWTs, capability tickets) to forestall Authorization Confusion and multi-agent role attacks.
These findings define a diagnostic baseline for architects seeking to build autonomous AI agents whose runtime operation remains strictly within intended goal and authority envelopes (Puppala et al., 7 Feb 2026).
5. Extending AgentFence Concepts: Runtime Enforcement and Policy Guardrails
AgentFence aligns with a wider family of enforcement and guardrail paradigms validated in contemporary research. Runtime DSLs (e.g., AgentSpec (Wang et al., 24 Mar 2025)) provide event-driven enforcement hooks for agent action plans, with human-readable rule definitions that map triggers, predicates, and enforcements to observable agent trajectories. AGrail (Luo et al., 17 Feb 2025) introduces continual, lifelong guardrail layers centered on adaptive safety-check generation, memory-augmented refinement, and tool-based safety detection—thus offering robust and transferable runtime constraints. ShieldAgent (Chen et al., 26 Mar 2025) further brings probabilistic logic and LTL-based verifiable circuits into action-trajectory shielding for LLM agents, demonstrating state-of-the-art recall and efficiency in multi-environment adversarial settings.
These approaches can be directly integrated with AgentFence principles:
| Approach | Core Enforcement Method | Compatibility with AgentFence |
|---|---|---|
| AgentSpec (Wang et al., 24 Mar 2025) | DSL-triggered, rule-based | Augments architectural boundaries via runtime policies |
| AGrail (Luo et al., 17 Feb 2025) | LLM-driven checklists, checklist memory | Adaptive, lifelong guardrail on agent actions |
| ShieldAgent (Chen et al., 26 Mar 2025) | Probabilistic circuit-based, LTL verification | Action-trajectory shielding, explanation, and formal guarantees |
6. Multi-Agent Fencing in Robotics and Distributed Systems
The AgentFence paradigm also encompasses robotic and physical-agent contexts. In multi-agent robotics, fencing strategies implement containment of adversarial or unknown targets using distributed collaborating agents ("dogs" herding "sheep") with guarantees provided by control barrier functions (CBFs) and convex quadratic programming (Grover et al., 2022). In adversarial swarm herding, a closed defender formation ("StringNet") ensures full spatial containment and safe navigation amidst obstacles with provable Lyapunov stability and collision avoidance (Chipade et al., 2019). Label-free strategies enable rigid fencing and collision-free tracking of moving targets without any pre-assigned agent roles (Hu et al., 2023).
In patrolling scenarios, the agent-fence problem is classically formulated as the optimal deployment of mobile agents to keep all points of a 1D/2D fence visited within a target idle time, with lower bounds and constructive upper bounds quantitatively characterized (Dumitrescu et al., 2014).
7. Conceptual Scope and Future Directions
AgentFence now denotes a general methodology for evaluating, enforcing, and guaranteeing safe agent behavior—spanning both architectural (LLM-based) and physical multi-agent contexts. In software, it offers systematic boundary-mapping, architecture-level metrics, and leverages runtime DSL and logic-based guardrails for deep agents. For mobile and cyber-physical systems, it provides distributed control-theoretic and optimization-based containment protocols.
Future research involves:
- Expanding empirical coverage to heterogeneous agent types and dynamic architectures.
- Formal synthesis of boundary policies leveraging LLM-derived guardrail specifications.
- Robustification against novel attack vectors, including emergent multi-agent delegation and cross-agent leakage.
- Human-in-the-loop auditing, compliance traceability, and integration with formal certification and verification layers.
AgentFence thus provides both a conceptual and operational reference point for secure autonomous agent design at scale (Puppala et al., 7 Feb 2026, Wang et al., 24 Mar 2025, Luo et al., 17 Feb 2025, Chen et al., 26 Mar 2025, Grover et al., 2022, Chipade et al., 2019, Hu et al., 2023, Dumitrescu et al., 2014).