- The paper introduces CySecBench—a dataset of 12,662 cybersecurity prompts across 10 categories for LLM jailbreaking assessment.
- It employs GPT-based prompt generation and a two-phase filtration process to identify, clarify, and validate adversarial content.
- Performance evaluations with ChatGPT, Claude, and Gemini reveal varying success rates, underscoring the need for robust AI security defenses.
CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking LLMs (2501.01335)
Introduction
The paper introduces CySecBench, a comprehensive, domain-specific dataset aimed at evaluating jailbreaking techniques in the cybersecurity domain. It addresses the limitations of existing datasets that are too broad and open-ended by providing a collection of 12,662 prompts organized into ten distinct categories. These categories are designed to enable a consistent and accurate assessment of the effectiveness of jailbreaking attempts on LLMs.
Prompt Generation and Dataset Specification
Prompt Generation and Filtration
The prompts are generated using OpenAI's GPT models, specifically GPT-1-mini and GPT-3.5-turbo, to compile a list of cybersecurity terms associated with various cyberattacks. These terms are categorized into groups such as Cloud Attacks, Cryptographic Attacks, and Malware Attacks, among others. For filtration, a two-phase process was implemented: (i) filtering prompts with non-malicious content, and (ii) using LLMs to identify and enhance malicious prompts for clarity.
Algorithm: GPT-Assisted Prompt Filtering
- Identify malicious prompts using a GPT-based evaluation.
- Rephrase prompts for clarity while maintaining their adversarial intent.
- Validate rephrased prompts for malicious content.
- Store validated prompts in the dataset.
Content Specification
The dataset includes 12,662 prompts across ten cybersecurity categories. These categories encompass a wide range of attacks, from cloud infrastructure compromises to web application vulnerabilities, providing an extensive test bed for LLM security measures.
Key Attributes of CySecBench:
- Domain-specific focus on cybersecurity.
- Comprehensive coverage with ten attack categories.
- Prompts are close-ended, ensuring consistency in evaluations.
LLM Jailbreaking
The paper proposes a simple yet effective jailbreaking method utilizing prompt obfuscation. An LLM is instructed to generate a question set from initial prompts, and then another LLM is tasked with generating a solution set. This method was evaluated using the CySecBench dataset, with performance metrics such as Success Rate (SR) and Average Rating (AR) used for assessment.
Implementation
The jailbreaking architecture employs three popular LLMs—ChatGPT, Claude, and Gemini—each handling different phases of the process. An automated evaluation routine using a GPT judge rates the outputs on a scale from 1 to 5, demonstrating distinct patterns in LLM's resilience against this approach.
The jailbreaking method showed Variances in LLM's effectiveness:
- Gemini achieved an SR of 88.4\%, indicating weaker filtering mechanisms.
- Claude showed stronger defenses with an SR of 17.4\%, suggesting robust filtering strategies.
Comparison with other datasets like AdvBench showed that domain-specific prompts significantly enhance jailbreak success rates.
Enhancements via Refinements
The authors enhanced the jailbreak methodology by introducing a word-reversal obfuscation technique and a solution refinement step, improving the SR to 78.5\% on AdvBench. This demonstrates the potential of systematic improvement in jailbreaking techniques through process optimization.
Discussion
CySecBench provides a domain-specific evaluation framework compared to broader datasets. Its larger size allows for more reliable assessment of LLM's security, making it a valuable tool for researchers and developers looking to improve AI model safety.
Conclusions
CySecBench advances LLM security evaluation by offering a domain-specific dataset and an effective jailbreaking methodology. Its contributions lie in improving threat assessment and enabling more reliable model evaluations, suggesting directions for future improvements in AI safety.
Figure 1: The proposed jailbreaking architecture.
Figure 2: Jailbreaking performance of the proposed method evaluated using three different LLMs.
Figure 3: Enhancing the jailbreak architecture by (a) introducing additional obfuscation and (b) incorporating a refinement step.
Figure 4: ChatGPT output.
Figure 5: Claude output.
Figure 6: Gemini output.
This research highlights the effectiveness of tailored datasets like CySecBench in facilitating more accurate evaluations of LLMs against specific security threats, paving the way for robust AI system development.