AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models

Published 17 Jun 2025 in cs.CR | (2506.14682v1)

Abstract: We introduce AIRTBench, an AI red teaming benchmark for evaluating LLMs' ability to autonomously discover and exploit Artificial Intelligence and Machine Learning (AI/ML) security vulnerabilities. The benchmark consists of 70 realistic black-box capture-the-flag (CTF) challenges from the Crucible challenge environment on the Dreadnode platform, requiring models to write python code to interact with and compromise AI systems. Claude-3.7-Sonnet emerged as the clear leader, solving 43 challenges (61% of the total suite, 46.9% overall success rate), with Gemini-2.5-Pro following at 39 challenges (56%, 34.3% overall), GPT-4.5-Preview at 34 challenges (49%, 36.9% overall), and DeepSeek R1 at 29 challenges (41%, 26.9% overall). Our evaluations show frontier models excel at prompt injection attacks (averaging 49% success rates) but struggle with system exploitation and model inversion challenges (below 26%, even for the best performers). Frontier models are far outpacing open-source alternatives, with the best truly open-source model (Llama-4-17B) solving 7 challenges (10%, 1.0% overall), though demonstrating specialized capabilities on certain hard challenges. Compared to human security researchers, LLMs solve challenges with remarkable efficiency completing in minutes what typically takes humans hours or days-with efficiency advantages of over 5,000x on hard challenges. Our contribution fills a critical gap in the evaluation landscape, providing the first comprehensive benchmark specifically designed to measure and track progress in autonomous AI red teaming capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AIRTBench, a benchmark that evaluates LLMs' autonomous red teaming abilities through 70 capture-the-flag challenges.
It compares model performances, with Claude-3.7-Sonnet achieving 61.4% success and highlighting different proficiencies in prompt injection versus complex tasks.
The research offers practical insights for cybersecurity, providing open-source tools to help SOCs and AI engineers preemptively address vulnerabilities.

Measuring Autonomous AI Red Teaming Capabilities with AIRTBench

The advancement of LLMs has introduced a new frontier in artificial intelligence applications, particularly within the field of cybersecurity. The paper "AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in LLMs" (2506.14682) explores the creation and utilization of AIRTBench—a benchmark specifically designed for assessing the autonomous AI red teaming abilities of LLMs.

Introduction to AIRTBench

AIRTBench is developed to evaluate LLMs based on their capabilities to autonomously discover and exploit security vulnerabilities in AI/ML systems. This benchmark consists of a suite of 70 black-box capture-the-flag (CTF) challenges that require LLMs to write Python code aimed at compromising AI systems. By leveraging the Crucible challenge environment on the Dreadnode platform, AIRTBench provides a realistic and comprehensive measurement tool for AI red teaming abilities, facilitating the comparison between different models' performance in adversarial contexts.

Model Performance Analysis

One of the paper's key highlights is the detailed comparison between various high-performing models in solving these challenges. Claude-3.7-Sonnet emerged as the leading model, solving 61.4% of the challenges, followed closely by Gemini-2.5-Pro and GPT-4.5-Preview, with 55.7% and 48.6% success rates, respectively. These models excel at certain attack types, particularly prompt injection challenges, while struggling with more complex tasks such as system exploitation and model inversion.

The success of these models is measured in minutes for tasks that traditionally would require hours for human operators, demonstrating a substantial efficiency advantage. However, several challenges remained entirely unsolved by all models, highlighting existing gaps in autonomous AI red teaming capabilities.

Figure 1: AIRTBench Harness Architecture Overview

Practical Implications for Cybersecurity

The implications of this research stretch across various sectors of the cybersecurity ecosystem. For Security Operations Centers (SOC), the insights provided by AIRTBench could be instrumental in refining monitoring and defense strategies against emerging threats. Similarly, red teams could utilize these findings to simulate realistic AI system attacks, allowing organizations to proactively identify vulnerabilities.

Furthermore, AI/ML security engineers can use AIRTBench to validate systems against common attack vectors, enhancing the security and reliability of LLM applications. The benchmark allows these engineers to prioritize vulnerabilities as per frameworks like MITRE ATLAS and OWASP Top 10, ensuring robust protection against model-specific threats.

Artifact Availability and Open-source Contributions

The research contributes significantly to the academic and operational landscape by providing open-source access to AIRTBench's evaluation tools and dataset. This enables broader community engagement and facilitates the development of improved security benchmarks and mechanisms. With the open-source code available on GitHub, AIRTBench serves as a foundation for advancing AI red teaming capabilities.

Conclusion

The introduction of AIRTBench fills a critical gap in AI security benchmarking by providing a structured, comprehensive framework for evaluating and enhancing autonomous AI red teaming capabilities. The benchmark not only measures current model performances but also sets a standard for future advancements in the field. By bridging academic research with practical cybersecurity applications, AIRTBench contributes significantly to the development of reliable and efficient AI defenses, ensuring the safe deployment of LLMs in critical infrastructure across various industries. As models evolve, AIRTBench will undoubtedly play a crucial role in tracking and improving AI's ability to navigate complex security landscapes.