NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Published 8 Jun 2024 in cs.CR, cs.AI, cs.CY, and cs.LG | (2406.05590v3)

Abstract: LLMs are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized benchmark, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our benchmark dataset open source to public https://github.com/NYU-LLM-CTF/NYU_CTF_Bench along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper proposes a novel benchmark dataset with 200 validated CTF challenges from NYU’s CSAW competitions.
It details a modular framework integrating LLMs with domain-specific tools for dynamic offensive security evaluation.
Performance tests reveal GPT-4’s superiority over open-source models, highlighting key areas for LLM improvement.

A Comprehensive Evaluation of LLMs in Offensive Security through the NYU CTF Dataset

LLMs have seen increasing utilization across various domains, extending their capabilities into the field of cybersecurity. The paper presents a novel approach for evaluating LLMs in the context of Capture the Flag (CTF) challenges—competitive scenarios that simulate real-world cybersecurity tasks. By developing a substantial, open-source benchmark dataset specifically curated for CTF challenges, this research provides an indispensable resource for assessing and enhancing the performance of LLMs in offensive security.

Introduction to the NYU CTF Dataset

The NYU CTF dataset is meticulously designed to accommodate the diverse and intricate nature of CTF challenges. The dataset comprises 200 validated challenges sourced from New York University's (NYU) Cybersecurity Awareness Week (CSAW) competitions. These challenges encompass six distinct categories: cryptography, forensics, binary exploitation (pwn), reverse engineering, web exploitation, and miscellaneous tasks. Each category offers a unique set of obstacles requiring advanced reasoning and technical proficiency, thus serving as rigorous tests for LLMs.

Dataset Structure and Categories

The dataset includes comprehensive metadata for each challenge, detailing its description, difficulty level, associated files, and tools required for solving it. The challenges are designed to mirror real-world cyber threats, and their validation ensures they remain solvable despite changes in software environments over the years. By structuring the dataset in a standardized format and integrating it with Docker containers for dynamic challenge loading, the authors provide a robust platform for systematic LLM evaluation.

Automated Framework for CTF Evaluation

To facilitate the automated evaluation of LLMs, the authors introduce a sophisticated framework that orchestrates the interaction between LLMs and the CTF challenges. This framework consists of five primary modules:

Backend Module:
- Supports multiple LLM service configurations, including OpenAI, Anthropic, and open-source models via TGI and VLLMs.
- Ensures seamless communication and model inference by leveraging API keys and URLs.
Data Loader:
- Efficiently loads challenges either from Docker containers or local files.
- Implements a garbage collection mechanism to manage resources effectively, stopping and removing containers post-challenge completion.
External Tools:
- Enhances LLMs with domain-specific tools such as decompilers, function callers, and command execution utilities.
- Designed to augment the problem-solving capabilities of LLMs in a cybersecurity context.
Logging System:
- Utilizes rich text Markdown formats for structured logging, aiding in detailed post-execution analysis.
- Captures system prompts, user prompts, model outputs, and debugging information for comprehensive evaluation.
Prompt Module:
- Constructs system and user prompts based on CTF metadata.
- Facilitates structured interactions ensuring LLMs have the necessary information to attempt solving the challenges.

Performance Evaluation

The evaluation of five LLMs (GPT-3.5, GPT-4, Claude, Mixtral, and LLaMA) across 200 CTF challenges demonstrated varied capabilities. Notably, GPT-4 performed the best overall, albeit with limited success, whereas open-source models like Mixtral and LLaMA did not solve any challenges. This highlights the current gap between black-box commercial models and their open-source counterparts in handling complex cybersecurity tasks.

Comparison with Human Performance

When comparing LLMs to human participants in CSAW competitions, it is evident that while LLMs like GPT-4 and Claude show promise, they still lag behind the median performance of human experts. This underscores the need for further refinement and development of LLMs to enhance their effectiveness in CTF challenges.

Ethical Considerations

Integrating LLMs in offensive security poses significant ethical challenges. The potential for misuse in launching sophisticated cyber-attacks necessitates stringent ethical guidelines and robust security measures. Educating cybersecurity professionals on AI ethics and ensuring the responsible deployment of LLMs are critical to mitigating these risks.

Conclusion and Future Directions

The NYU CTF dataset and the accompanying evaluation framework represent a significant step forward in benchmarking LLMs for cybersecurity applications. However, the authors acknowledge the need for addressing dataset imbalance, enhancing tool support, and keeping pace with advancements in LLM development. Future research should focus on expanding the dataset, incorporating a broader array of challenges, and continuously updating model support to maintain the framework's relevance and utility.

Implications for AI Development

This research has practical implications for advancing AI-driven solutions in cybersecurity. By providing a rigorous benchmark and a detailed framework for evaluation, it paves the way for more sophisticated LLMs capable of tackling real-world cybersecurity threats. The theoretical implications extend to the broader understanding of LLM capabilities in dynamic, multi-step reasoning tasks, highlighting areas for improvement and further study.

References

A detailed list of references cited in the research can be found in the paper, providing additional context and support for the methodologies and findings presented. The integration of past studies and contemporary advancements underlines the comprehensive nature of this work, grounding it firmly in the current landscape of AI and cybersecurity research.