AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Published 26 Jan 2026 in cs.AI, cs.CC, cs.CL, cs.CV, and cs.LG | (2601.18491v1)

Abstract: The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

Abstract PDF Upgrade to Chat

Summary

The paper presents AgentDoG, a diagnostic framework that uses a three-dimensional taxonomy (risk source, failure mode, and harm) to achieve process-level safety monitoring.
It employs a multi-agent planner to generate over 100k synthetic, tool-augmented trajectories, ensuring comprehensive coverage of diverse agent risks.
AgentDoG outperforms prior models by significantly improving both binary moderation and fine-grained causal diagnosis, doubling risk source attribution accuracy.

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Motivation and Problem Setting

Autonomous AI agents executing complex tasks through tool use and environmental interactions expose previously unaddressed safety and security risks, including but not limited to prompt injection, malicious tool chaining, and emergent failures during long-horizon planning. Existing guardrails—predominantly designed for single-turn output moderation—are fundamentally inadequate for the agentic context due to their lack of agentic risk awareness and insufficient provenance or transparency. The binary “safe/unsafe” paradigm misses both nuanced process-level failures and the diagnosis required for actionable alignment.

Agentic Safety Taxonomy

A central innovation is the introduction and operationalization of a unified agentic safety taxonomy with three orthogonal dimensions: risk source (“where”), failure mode (“how”), and real-world harm (“what”). This taxonomy enables hierarchical, non-overlapping categorization of agentic risks spanning adversarial user instructions, tool corruption, environmental contamination, internal planning flaws, and consequences across privacy, financial, security, physical, psychological, societal, and functional domains.

Figure 1: Overview of the three orthogonal dimensions of the agentic safety taxonomy.

This structural decomposition is foundational for process-level safety monitoring and root cause analysis that surpass prior flat taxonomies. Each dimension is specified with explicit categories: 8 for risk source, 14 for failure mode, and 10 for harm, enabling systematic coverage of the agentic risk space.

Data Synthesis Pipeline

Rather than relying on narrow, manually-curated benchmarks, AgentDoG employs a workflow for large-scale, taxonomy-aligned synthetic data generation via an orchestrator-controlled, multi-agent planner. This pipeline produces over 100k coherent, tool-augmented multi-turn trajectories, sampling across all taxonomy dimensions and using a tool library of ~10k distinct APIs—orders of magnitude larger than previous benchmarks.

Figure 2: Three-stage, planner-based pipeline for multi-step agent safety trajectory synthesis.

This approach ensures systematically stratified coverage of both prevalent and long-tail risk patterns, with rigorous quality control including both deterministic and LLM-based semantic validation ensuring high-fidelity labels and process traceability.

Figure 3: Distribution of synthesized training data across the three taxonomy dimensions: risk source, failure mode, and real-world harm.

Benchmarking: ATBench

ATBench is introduced as a held-out, taxonomy-grounded, trajectory-level benchmark with 500 complex multi-turn, tool-enabled agent interaction records, balanced for safety verdict and explicitly labeled along all three taxonomic dimensions. Crucially, there is enforced zero overlap between training and benchmark tools for robust generalization assessment.

Figure 4: ATBench benchmark unsafe data distribution

Data quality is ensured by multi-agent verification (Qwen-QwQ, GPT-5.2, Gemini 3 Pro, DeepSeek-V3.2) and exhaustive human adjudication. ATBench evaluates not only binary risk detection but also fine-grained causal attribution, directly enabling comparative assessments of transparency and diagnostic depth.

AgentDoG Model Architecture and Training

AgentDoG is realized as supervised fine-tuned guardrail variants across Qwen and LLama model families at 4B, 7B, and 8B parameter scales. Training targets both trajectory-level binary safety judgment and fine-grained taxonomy annotation, leveraging large-scale, diverse synthetic trajectories. The training objective is negative log-likelihood of correct safety labels conditioned on entire process traces, supporting long-context, tool-rich scenarios.

Experimental Results

Binary Safety Moderation

AgentDoG demonstrates a marked improvement over prior guard models across three major benchmarks (R-Judge, ASSE-Safety, ATBench). For instance, AgentDoG-Qwen3-4B achieves 91.8% accuracy and a 92.7% F1 on R-Judge and 92.8% accuracy and a 93.0% F1 on ATBench, outperforming all tested guardrail models by substantial margins and matching or surpassing strong general foundation models despite much smaller parameter scale.

Figure 5: Accuracy(\%) of AgentDoG and existing general and guardrail models.

Existing guardrails exhibit dramatically low recall, frequently missing action-induced risks and intermediate unsafe steps, underscoring a fundamental failure of the binary output moderation paradigm in complex agent settings.

Fine-grained Diagnostic and Attribution Capabilities

AgentDoG achieves 82.0% accuracy for risk source, 32.4% for failure mode, and 58.4% for harm on ATBench fine-grained diagnosis—doubling the risk source attribution performance relative to best generalist baselines.

AgentDoG incorporates a hierarchical attribution module for agentic XAI, leveraging trajectory-level information gain and sentence-level perturbation to assign quantitative influence scores for every interaction step and utterance, directly localizing the internal driver of unsafe actions.

Figure 6: Illustration of attribution results across two representative scenarios—prompt injection in resume screening and sarcasm misinterpretation in financial analysis.

Comparative analysis shows safety-supervised AgentDoG identifies critical reasoning failures missed by the base model, enhancing causality and diagnostic transparency.

Figure 7: Comparative attribution analysis between AgentDoG-Qwen3-4B and Qwen3-4B-Instruct-2507 on ambiguous financial transaction.

Implications and Future Directions

AgentDoG advances the state of agentic safety guardrails with systematic risk modeling, scalable synthesis, rigorous multi-axis evaluation, and diagnostic XAI. The framework establishes causal traceability in agent risk assessment—crucial for debugging, alignment intervention, and regulatory auditing. Practical deployment benefits from higher recall and fine-grained transparency, especially under out-of-distribution tool scenarios and adversarial contexts.

Limitations remain in multimodal input handling and the current reactive monitoring paradigm; future research should pursue support for GUI and visual-planning traces and integrate diagnostic feedback as a reward signal for RL policy alignment, progressing toward proactive agentic risk mitigation.

Conclusion

AgentDoG provides a unified, interpretable foundation for process-level agent safety—spanning risk taxonomy, scalable data synthesis, robust benchmarking, causal diagnosis, and actionable guardrails—with open model and benchmark release to facilitate standardized research and deployment. The diagnostic paradigm introduced here signals a necessary shift in agent safety science from output moderation toward traceable, fine-grained, and context-aware provenance.