"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

Published 7 Aug 2023 in cs.CR and cs.LG | (2308.03825v2)

Abstract: The misuse of LLMs has drawn significant attention from the general public and LLM vendors. One particular type of adversarial prompt, known as jailbreak prompt, has emerged as the main attack vector to bypass the safeguards and elicit harmful content from LLMs. In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Abstract PDF Upgrade to Chat

Citations (160)

View on Semantic Scholar

Summary

The paper comprehensively collects and evaluates 6,387 prompts to reveal the evolving threat of adversarial jailbreak attacks on LLMs.
The study employs NLP and graph-based community detection to identify 666 jailbreak prompts and exposes a shift from public to private platforms.
The evaluation of five LLMs shows near-perfect attack success rates, emphasizing the critical need for enhanced safety defenses.

Characterizing and Evaluating In-The-Wild Jailbreak Prompts on LLMs

The paper "Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on LLMs" presents a comprehensive study on the misuse of LLMs through adversarial prompts known as jailbreak prompts. This study stands as the first to methodically collect, characterize, and evaluate jailbreak prompts from multiple online platforms and offers insights into their effect on LLMs’ safety mechanisms.

Jailbreak prompts are crafted to bypass the safeguards of LLMs, compelling them to output harmful content. The researchers collected 6,387 prompts over six months from platforms such as Reddit, Discord, and others, identifying 666 as jailbreak prompts. Utilizing NLP and graph-based community detection, the study uncovers unique traits and strategies of these prompts, which evolve to evade detection. Jailbreak strategies include prompt injection, privilege escalation, and deception.

The study's findings suggest a troubling trend: jailbreak prompts, originally shared on public platforms like Reddit, are moving to private platforms such as Discord, limiting the ability of LLM vendors to proactively detect these threats. The evolution of jailbreak prompts shows a reduction in length, accompanied by an increase in toxicity, indicating adversaries are optimizing for both stealth and efficacy.

For quantitative evaluation, the study presents a dataset comprising 46,800 samples across 13 forbidden scenarios and measures the performance of five representative LLMs: ChatGPT (GPT-3.5), GPT-4, ChatGLM, Dolly, and Vicuna. The results are striking; authoritative jailbreaking prompts achieve near-perfect attack success rates, with some prompts remaining online for extended periods. Surprisingly, even sophisticated LLMs like GPT-4 show vulnerabilities when faced with these prompts, suggesting the defensive mechanisms currently employed are inadequate.

The implications are significant for LLM deployment and development. The study contributes to understanding the threat landscape, aligning safer LLM development, and informing policy-making. It suggests that external safeguards like OpenAI moderation endpoint and NeMo-Guardrails offer minimal mitigation, indicating a critical need for improved defensive mechanisms and community-driven solutions to enhance model robustness.

Overall, this paper highlights the urgent need to address the security vulnerabilities posed by jailbreak prompts, particularly as LLMs become more integrated into critical applications. Effective countermeasures require collaborative efforts from researchers, developers, and policymakers to ensure that LLMs not only advance AI capabilities but also maintain safety and public trust.