CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models

Published 2 Jan 2025 in cs.CR, cs.AI, and cs.LG | (2501.01335v1)

Abstract: Numerous studies have investigated methods for jailbreaking LLMs to generate harmful content. Typically, these methods are evaluated using datasets of malicious prompts designed to bypass security policies established by LLM providers. However, the generally broad scope and open-ended nature of existing datasets can complicate the assessment of jailbreaking effectiveness, particularly in specific domains, notably cybersecurity. To address this issue, we present and publicly release CySecBench, a comprehensive dataset containing 12662 prompts specifically designed to evaluate jailbreaking techniques in the cybersecurity domain. The dataset is organized into 10 distinct attack-type categories, featuring close-ended prompts to enable a more consistent and accurate assessment of jailbreaking attempts. Furthermore, we detail our methodology for dataset generation and filtration, which can be adapted to create similar datasets in other domains. To demonstrate the utility of CySecBench, we propose and evaluate a jailbreaking approach based on prompt obfuscation. Our experimental results show that this method successfully elicits harmful content from commercial black-box LLMs, achieving Success Rates (SRs) of 65% with ChatGPT and 88% with Gemini; in contrast, Claude demonstrated greater resilience with a jailbreaking SR of 17%. Compared to existing benchmark approaches, our method shows superior performance, highlighting the value of domain-specific evaluation datasets for assessing LLM security measures. Moreover, when evaluated using prompts from a widely used dataset (i.e., AdvBench), it achieved an SR of 78.5%, higher than the state-of-the-art methods.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CySecBench—a dataset of 12,662 cybersecurity prompts across 10 categories for LLM jailbreaking assessment.
It employs GPT-based prompt generation and a two-phase filtration process to identify, clarify, and validate adversarial content.
Performance evaluations with ChatGPT, Claude, and Gemini reveal varying success rates, underscoring the need for robust AI security defenses.

CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking LLMs (2501.01335)

Introduction

The paper introduces CySecBench, a comprehensive, domain-specific dataset aimed at evaluating jailbreaking techniques in the cybersecurity domain. It addresses the limitations of existing datasets that are too broad and open-ended by providing a collection of 12,662 prompts organized into ten distinct categories. These categories are designed to enable a consistent and accurate assessment of the effectiveness of jailbreaking attempts on LLMs.

Prompt Generation and Dataset Specification

Prompt Generation and Filtration

The prompts are generated using OpenAI's GPT models, specifically GPT-1-mini and GPT-3.5-turbo, to compile a list of cybersecurity terms associated with various cyberattacks. These terms are categorized into groups such as Cloud Attacks, Cryptographic Attacks, and Malware Attacks, among others. For filtration, a two-phase process was implemented: (i) filtering prompts with non-malicious content, and (ii) using LLMs to identify and enhance malicious prompts for clarity.

Algorithm: GPT-Assisted Prompt Filtering

Identify malicious prompts using a GPT-based evaluation.
Rephrase prompts for clarity while maintaining their adversarial intent.
Validate rephrased prompts for malicious content.
Store validated prompts in the dataset.

Content Specification

The dataset includes 12,662 prompts across ten cybersecurity categories. These categories encompass a wide range of attacks, from cloud infrastructure compromises to web application vulnerabilities, providing an extensive test bed for LLM security measures.

Key Attributes of CySecBench:

Domain-specific focus on cybersecurity.
Comprehensive coverage with ten attack categories.
Prompts are close-ended, ensuring consistency in evaluations.

LLM Jailbreaking

The paper proposes a simple yet effective jailbreaking method utilizing prompt obfuscation. An LLM is instructed to generate a question set from initial prompts, and then another LLM is tasked with generating a solution set. This method was evaluated using the CySecBench dataset, with performance metrics such as Success Rate (SR) and Average Rating (AR) used for assessment.

Implementation

The jailbreaking architecture employs three popular LLMs—ChatGPT, Claude, and Gemini—each handling different phases of the process. An automated evaluation routine using a GPT judge rates the outputs on a scale from 1 to 5, demonstrating distinct patterns in LLM's resilience against this approach.

Performance Evaluation

The jailbreaking method showed Variances in LLM's effectiveness:

Gemini achieved an SR of 88.4\%, indicating weaker filtering mechanisms.
Claude showed stronger defenses with an SR of 17.4\%, suggesting robust filtering strategies.

Comparison with other datasets like AdvBench showed that domain-specific prompts significantly enhance jailbreak success rates.

The authors enhanced the jailbreak methodology by introducing a word-reversal obfuscation technique and a solution refinement step, improving the SR to 78.5\% on AdvBench. This demonstrates the potential of systematic improvement in jailbreaking techniques through process optimization.

Discussion

CySecBench provides a domain-specific evaluation framework compared to broader datasets. Its larger size allows for more reliable assessment of LLM's security, making it a valuable tool for researchers and developers looking to improve AI model safety.

Conclusions

CySecBench advances LLM security evaluation by offering a domain-specific dataset and an effective jailbreaking methodology. Its contributions lie in improving threat assessment and enabling more reliable model evaluations, suggesting directions for future improvements in AI safety.

Figure 1: The proposed jailbreaking architecture.

Figure 2: Jailbreaking performance of the proposed method evaluated using three different LLMs.

Figure 3: Enhancing the jailbreak architecture by (a) introducing additional obfuscation and (b) incorporating a refinement step.

Figure 4: ChatGPT output.

Figure 5: Claude output.

Figure 6: Gemini output.

This research highlights the effectiveness of tailored datasets like CySecBench in facilitating more accurate evaluations of LLMs against specific security threats, paving the way for robust AI system development.