Papers
Topics
Authors
Recent
Search
2000 character limit reached

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Published 5 May 2026 in cs.AI and cs.CR | (2605.04019v1)

Abstract: AI systems are entering critical domains like healthcare, finance, and defense, yet remain vulnerable to adversarial attacks. While AI red teaming is a primary defense, current approaches force operators into manual, library-specific workflows. Operators spend weeks hand-crafting workflows - assembling attacks, transforms, and scorers. When results fall short, workflows must be rebuilt. As a result, operators spend more time constructing workflows than probing targets for security and safety vulnerabilities. We introduce an AI red teaming agent built on the open-source Dreadnode SDK. The agent creates workflows grounded in 45+ adversarial attacks, 450+ transforms, and 130+ scorers. Operators can probe multi-agent systems, multilingual, and multimodal targets, focusing on what to probe rather than how to implement it. We make three contributions: 1. Agentic interface. Operators describe goals in natural language via the Dreadnode TUI (Terminal User Interface). The agent handles attack selection, transform composition, execution, and reporting, letting operators focus on red teaming. Weeks compress to hours. 2. Unified framework. A single framework for probing traditional ML models (adversarial examples) and generative AI systems (jailbreaks), removing the need for separate libraries. 3. Llama Scout case study. We red team Meta Llama Scout and achieve an 85% attack success rate with severity up to 1.0, using zero human-developed code

Summary

  • The paper introduces an operator-centric, agentic red teaming framework that automates adversarial tests from weeks to hours, achieving an 85% attack success rate.
  • The methodology leverages a composable SDK with 45+ attacks, 450+ transforms, and 130+ scorers to unify traditional ML adversarial attacks with LLM jailbreaks.
  • The approach enhances auditability and regulatory compliance by mapping findings to established risk frameworks while reducing human workload.

Redefining AI Red Teaming for Agentic AI: From Manual Labor to Operator-Centric Automation

Introduction

The proliferation of agentic, multimodal, and multilingual AI—deployed in safety-critical domains—demands systematic, scalable, and auditable approaches for adversarial robustness evaluation. "Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours" (2605.04019) proposes an architectural and workflow shift: abandoning library-driven, hand-crafted red teaming in favor of natural language-driven, agentic systems that synthesize, execute, and analyze attacks. The epitome of this paradigm is the Dreadnode AI Red Teaming Agent, underpinned by a composable SDK encompassing a broad adversarial catalog (45+ attacks, 450+ transforms, 130+ scorers). This essay analyzes the technical foundations, empirical evidence, architectural design, results, and implications of this framework, with emphasis on its unification of traditional ML adversarial attacks and generative LLM jailbreaks into a cohesive, operator-centered pipeline.

From Library-Centric to Agentic AI Red Teaming

Traditional workflows in adversarial ML require operators to master attack techniques, parameterization, and orchestrate attack strategies manually. The cognitive load scales linearly with attack catalog size, and combinatorially with model attack surfaces (multilinguality, tool use, cross-modal reasoning), resulting in prolonged assessment cycles and inconsistent coverage. The Dreadnode framework reorients red teaming as an agent-driven interaction: operators provide high-level goals and constraints in natural language. The agent autonomously selects attack strategies, constructs transformative chains, configures scorers, and manages end-to-end execution against target systems.

This approach compresses weeks of manual configuration into hours of assessment, as evidenced in the Llama Scout case study (68 goals, 5 transforms, 3 attack types, 7,727 trials, ~85% attack success, completed in ~3 hours with zero custom code). The human operator is thus freed to focus on strategic coverage, risk prioritization, and attack refinement, rather than mechanical workflow assembly.

Architecture: Multi-Layered Agentic Red Teaming

The core system design consists of an interaction layer (conversational agent in TUI), an attack layer (catalog of attacks, transforms, scorers), an execution layer (OpenTelemetry-instrumented orchestrator), and an analytics layer (severity classification, compliance mapping, exportable findings). The agent maintains session context, allowing strategic iterative refinement ("now try multi-turn attacks", "add persona transforms") and stateful conversation that accelerates coverage exploration.

Attacks are abstracted to unify generative systems (prompt-based jailbreak, persona framing, role-play, adversarial suffix optimization) and traditional ML systems (input perturbation via SimBA, HopSkipJump, ZOO, NES). All attacks are implemented as optimization over input space—scorers automate evaluation against success criteria, enabling both model-agnostic and model-aware probing.

Transform modules (encoding/cipher, persona framing, language adaptation, tool-use, exfiltration, multimodal, supply chain/agentic exploits) and scorer modules (jailbreak detection, PII leak detection, advanced agentic behavior) are extensible and composable. The architecture's breadth enables both horizontal coverage (multimodal, multi-agent, multilingual) and vertical depth (novel attack composition).

Empirical Results: Llama Scout Case Study

The most compelling numerical evidence for the operator-centric agentic paradigm is presented in the systematic assessment of Meta's Llama Scout LLM. Using only natural language interaction via the TUI agent, the system orchestrated 681 assessments, yielding 573 discrete findings across 7,727 trials. Key statistics:

  • Attack success rate: 85%
  • Critical findings (score ≥ 0.9): 34.4% of all findings
  • Jailbreak completions: 59.5% of attacks, indicating full evasion of safety mechanisms

Notably, the majority of attack success was achieved using multi-turn attacks (Crescendo, Graph of Attacks with Pruning) and persona-based transforms (skeleton-key, role-play wrapper), which resulted in 100% success for several categories. Figure 1

Figure 2: Dashboard analytics reveal distinct vulnerabilities by attack type, goal category, and operational efficiency across the assessment campaign.

Transform-specific analysis indicates severe weaknesses in persona-based (100% success) and language adaptation (multilingual jailbreaks) transforms, with weaker but non-trivial robustness to encoding-based attacks (Base64, 75% success). Figure 3

Figure 4: Success rate granularity by transform type and attack type shows the target's susceptibility surface and transform-attack synergy.

A salient observation is the agent's ability to orchestrate attacks requiring no explicit transform—indicating fundamental alignment gaps in the target model. Figure 5

Figure 1: Heatmap cross-referencing attack strategy and transform with severity, contextualized by harm category prevalence.

Human-in-the-loop review is supported, enabling expert override of scorer misclassifications, ensuring auditability and regulatory traceability. Figure 6

Figure 3: Operators can audit, reclassify, and explain judgments, with analytics metrics updating in real time.

All findings are exhaustively mapped to compliance standards (OWASP LLM Top 10, MITRE ATLAS, NIST AI RMF, Google SAIF) with no manual overhead. Figure 7

Figure 5: Every finding is automatically tagged to established AI risk frameworks for immediate regulatory alignment and evidence export.

Case Study Workflow Realization

Operators initiate and refine assessments entirely through natural language instruction—a paradigm shift from legacy code-driven, parameter-heavy flows. Every finding includes full evidence traceability (prompt, transform, response, scorer decision, attack metadata, compliance mapping). Drill-down analytics enable investigation from executive-level dashboards to per-findings tables. Figure 8

Figure 6: Each table row provides instantaneous access to evidence, severity, scoring, and compliance context.

Figure 9

Figure 7: UI snapshot—the operator initiates TAP attacks with five transforms, achieving near-total coverage without manual engineering.

High-level metrics support operational risk management and investment prioritization. Figure 10

Figure 8: Executive dashboards synthesize posture, exposure, compliance status, and attack outcome trends.

Theoretical and Practical Implications

This agentic framework provides several consequential improvements:

  • Scalability and coverage: Human operators no longer constrain adversarial exploration; agentic automation enables full combinatorial coverage over goals, transforms, and attack types in feasible timeframes.
  • Unified abstraction: Both LLM jailbreaks and ‘classical‘ ML adversarial attacks are addressed with a single abstraction, catalyzing more consistent reporting, evidence tracking, and compliance mapping across organizational AI portfolios.
  • Operator focus realignment: Strategic guidance, result analysis, and refinement eclipse low-level workflow engineering. This partitioning more effectively leverages scarce human expertise for security-critical oversight.
  • Regulatory readiness: The system’s native compliance tagging and audit trail generation streamline risk evidence for regulatory and enterprise requirements.

Theoretically, the results further expose the deficiencies of current LLM safety alignment—demonstrating that even well-trained, state-of-art systems (Llama Scout) are systematically vulnerable to role-framing, language adaptation, and conversational escalation, corroborating cross-paper findings on persistent alignment failures (cf. Wei et al., 2023; Mulla et al., 2025).

Limitations

Operator oversight remains necessary: agent performance is bottlenecked by LLM understanding, scorer accuracy, and attack catalog completeness. The open-ended nature of adversarial innovation precludes exhaustive cataloging, and scorer failures remain possible. While operator time is significantly reduced, comprehensive multi-modal multi-domain assessments may still require days for the most expansive configurations.

Conclusion

"Redefining AI Red Teaming in the Agentic Era" (2605.04019) represents a decisive transition toward scalable, auditable, and operator-centric red teaming. The presented system compresses the timeline for comprehensive coverage from weeks to hours, unifies multi-domain attack support, and enables structured compliance-aligned analytics without manual labor. Empirical results underscore systematic model vulnerabilities and expose the inadequacy of current alignment paradigms against advanced adversarial workflows.

As AI systems become more agentic, interconnected, and operationally critical, only agentic adversarial assessment can match their complexity. Anticipating ongoing advancements in attack methodology, future work should focus on expanding the expressivity of high-level operator objectives, integrating more sophisticated reasoning for attack selection and prioritization, and continual adaptation to emergent attack surfaces in autonomous, collaborative, and real-time AI deployments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 35 likes about this paper.