Agentic Safety Taxonomy

Updated 30 January 2026

Agentic Safety Taxonomy is a structured framework that delineates harms and vulnerabilities specific to autonomous, tool-using, multi-agent AI systems.
It categorizes risks into distinct subcategories such as malware generation, malicious human interaction, harmful content, biased decisions, and unauthorized actions.
The taxonomy underpins empirical benchmarks and layered defense designs by mapping attack surfaces with validated data across diverse multi-agent environments.

Agentic safety taxonomy provides a structured, exhaustive map of the potential harms, misbehaviors, and vulnerabilities peculiar to autonomous, tool-using, and multi-agent AI systems. Unlike classical model safety taxonomies tied to supervised LLMs or isolated decision-makers, agentic safety frameworks must account for real-world impact surfaces (through tool invocation, inter-agent dialogue, or environment interaction), the compositional and emergent properties of multi-agent workflows, and the new attack and failure modes introduced by autonomy, planning, and orchestration. Formal, empirically validated taxonomies in this area, such as that presented in "Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms" (Nöther et al., 22 Aug 2025), serve as the backbone for benchmarks, defense design, security evaluation, and the development of systemic safety standards for advanced agentic platforms.

1. Objectives, Formalism, and Design Principles

The principal goal of agentic safety taxonomy is to enumerate "the complete space of real-world–impactful harms" that an adversarial peer or environment can induce in an agentic system. The formal agentic system is defined as:

$S = (o, G)$ where $o : X \times \tau \to \mathbb{R}$ , with $X$ the set of tasks and $\tau$ the trajectory of messages/tool calls.
$G = (\mathcal{A}, E)$ encodes the directed communication graph of agents.

Within this framework, a harmful action is any trajectory $\tau$ containing unauthorized commands or content that maximizes an adversarial objective $o_a$ disjoint from the system's intended $o$ .

Design criteria for robust taxonomy construction include:

Completeness: Must cover all permissioned, real-world actions an agentic system can take (e.g., file system access, network transactions, social messaging).
Discriminability: Subcategories are non-overlapping and feature crisp boundaries, ensuring that each malicious datapoint is unambiguously bucketed.
Extendibility: admits new subcategories without forcing a refactoring of the top-level structure as tool, environment, or agent capabilities grow (Nöther et al., 22 Aug 2025).

2. Category and Subcategory Structure

Nöther et al. (Nöther et al., 22 Aug 2025) propose a five-level top taxonomy, each with granular subcategories that delineate modes of harm and their real-world signatures:

Main Category	Subcategories (abbreviation)	Distinguishing Characteristic
Generation and Execution of Malware	GM (malware), UC (unsafe code), PI (private info), MF (malicious files), DF (deletion)	Code and tool misuse, intent (damage/exfiltration)
Malicious Interaction with Humans	I (impersonation), PS (phishing/spam), M (misinfo), P (privacy leak), PH (physical harm)	Direct human harm, messaging, intent clarity
Generation of Harmful Content	CI (copyright), T (toxicity)	Pure content-based (text/image), rights/psychological
Biased Decisions	A (advertisements)	Commercial bias without technical or content harm
Unauthorized Actions	TR (transaction), UM (unauthorized message), DOS (denial-of-service), SR (stealing resources)	Unpermitted tool calls/diversions, user intent deviation

Decision boundaries among subcategories are explicitly defined by action type (e.g., executing arbitrary shell code vs. introducing vulnerabilities vs. reading secrets) and by semantic features (e.g., impersonation asserts speaker identity difference, phishing solicits credentials, content harm vs. biased actions).

3. Methodology: Construction and Validation

Taxonomy derivation in BAD-ACTS (Nöther et al., 22 Aug 2025) combined:

Systematic literature review (synthesizing agentic AI harms and empirical agentic failures).
Mapping of hundreds of harmful candidate examples (including adversarial attacks and real-world exploits).
Iterative refinement, expanding or collapsing provisional subcategories to achieve non-overlap and full coverage.
Empirical validation by prompting advanced LLMs (Llama-3 70B, GPT-4) to enumerate novel harms in diverse application environments until saturation (30 consecutive redundant proposals), resulting in a 188-datapoint benchmark.

Each benchmark action is labeled by high-level category and subcategory, providing a schema for vulnerability tracking and defense targeting.

4. Taxonomy Integration and Benchmarking Utility

The taxonomy underpins BAD-ACTS, which spans four diverse multi-agent environments (decentralized travel planning; hierarchical financial article composition; CEO-worker code generation; sequential debate). Attackers control one peer agent and attempt to elicit a specific harmful action from the target system. Attack success rates (ASRs) are reported per taxonomy category and subcategory, enabling fine-grained empirical vulnerability analysis across architectures, role placements, and communication topologies.

Key findings:

ASR ≥60% for "Unauthorized Actions" (e.g., DOS, SR) and "Release Private Information"—demonstrating unique vulnerability in tool-enabled agentic systems.
ASR 30–60% for "Malware" (GM, UC) and "Causing Personal Harm" (PH).
ASR <25% for "Impersonation," "Malicious Files," and content-only harms ("Toxicity," "Copyright").
Larger LLMs are more, not less, susceptible.
Centralized/hierarchical communication offers slight improvements over fully decentralized topologies.

5. Impact on Defense Design and Systemic Risk Management

The prescriptive nature of this taxonomy enables:

Prompt-based defense: System prompts are augmented with explicit anti-harm instructions per subcategory (e.g., explicit refusals for file deletion or solicitations), resulting in a modest ASR reduction (3–7%) due to the ability of adversarial agents to circumvent generic instructions.
Message monitoring ("guardian agents"): Defensive agents or filters intercept and pattern-match messages/tool calls for category-specific harm signatures (e.g., regex for known code exploits, domain blacklists for PI, content classifier for T). This reduces ASR by 25–55% overall, especially for code- or tool-based categories with clear surface-level signals.

The taxonomy is thus a direct substrate for layered, category-driven defense orchestration, benchmarked response evaluation, and future-proofed extensibility.

6. Broader Context: Comparative Taxonomy Perspectives

Subsequent research aligns with and extends these foundations. For instance:

"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges" (Datta et al., 27 Oct 2025) structures threats into five branches: Prompt Injection & Jailbreaks, Cyber-Exploitation/Tool Abuse, Multi-Agent/Protocol-Level, Interface/Environment, and Governance/Autonomy. This taxonomy covers both concrete exploit vectors and systemic, emergent gaps.
Three-dimensional risk attribution (risk-source, failure-mode, real-world harm) as in AgentDoG (Liu et al., 26 Jan 2026) enables compositional diagnosis and root-cause transparency, complementing hierarchical action-category approaches.
Taxonomies operationalized in governance frameworks (e.g., AGENTSAFE (Khan et al., 2 Dec 2025)) integrate design, runtime, and audit controls for each taxonomic axis.
Agentic safety as applied in cyber-physical or laboratory domains (e.g., SDL safety along six axes (Chen et al., 25 Jan 2026)), and as lifecycle-aware in enterprise frameworks (e.g., Cisco's multi-level objectives/techniques/subtechniques (Chang et al., 15 Dec 2025)), demonstrates the increasing adoption of taxonomic rigor for both benchmarking and system design.

7. Future Directions, Limitations, and Open Challenges

The agentic safety taxonomy in BAD-ACTS (Nöther et al., 22 Aug 2025) provides a coverage-optimal, extensible, and empirically validated structure, but as the space of agentic operations expands (with new tools, open-ended orchestration, more complex real-world deployments), categories and boundaries must continue to evolve. Unresolved challenges include:

Robustness to adaptive and white-box adversarial agents.
Long-horizon security (e.g., compounds of small misbehaviors, recursive multi-agent exploits).
Human-in-the-loop interfaces for verifying complex, multi-step agentic workflows.
Integration and harmonization with protocol-specific safety properties (see formal properties in (Allegrini et al., 15 Oct 2025)) and standards-based or organization-wide risk frameworks.

Foundational taxonomies are thus not merely catalogues but practical, programmatic drivers for empirical adversarial evaluation, targeted defense research, and system architecture across all slices of the agentic pipeline.