ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Published 19 Feb 2024 in cs.CL and cs.AI | (2402.11753v4)

Abstract: Safety is critical to the usage of LLMs. Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs. Our code is available at https://github.com/uw-nsl/ArtPrompt.

Abstract PDF Upgrade to Chat

Citations (51)

View on Semantic Scholar

Summary

The paper introduces ArtPrompt, an ASCII art-based jailbreaking method that bypasses LLM safety by masking and cloaking harmful words via visual encoding.
The paper demonstrates that ArtPrompt achieves high attack success rates and harmfulness scores on multiple LLMs, outperforming methods like GCG and AutoDAN.
The paper emphasizes the need to upgrade LLM safety measures to incorporate multi-modal evaluations against unconventional, visual-based inputs.

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

The paper "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs" presents an innovative method for exploiting vulnerabilities in LLMs through ASCII art-based attacks. It highlights the limitations of current LLM safety measures that rely solely on semantic interpretations, proposing a novel attack strategy that leverages visual encoding. This essay summarizes key components of ArtPrompt, its implementation, and implications for LLM security.

Introduction to ArtPrompt

ArtPrompt introduces a jailbreak technique that circumvents LLM safety protocols by embedding harmful instructions within ASCII art. This method takes advantage of LLMs' inability to interpret visual semantic information, thus enabling malicious prompts to bypass traditional safety filters and induce undesired behaviors from models like GPT-3.5, GPT-4, and others.

Figure 1: Two instances of harmful instructions bypassing LLM safety using ASCII art.

Mechanism of ArtPrompt

ArtPrompt operates in two primary steps:

Word Masking: Critical words that trigger safety mechanisms are identified and temporarily masked. This masking creates a template for constructing a cloaked prompt.
Cloaked Prompt Generation: The masked words are replaced by their ASCII art representations. These replacements are then combined with the masked prompt to form a comprehensive cloaked prompt that evades rejection from the victim LLM.
Figure 2: Overview of ArtPrompt showing how masked words are transformed with ASCII art into a cloaked prompt.

Performance Evaluation

The researchers evaluated ArtPrompt against five state-of-the-art LLMs using two datasets: AdvBench and HEx-PHI. The attack demonstrated a high attack success rate (ASR) and harmfulness score (HS) across all models, significantly outperforming existing jailbreak methods like GCG and AutoDAN.

Figure 3: Harmfulness score of ArtPrompt on HEx-PHI dataset, illustrating success across multiple prohibited categories.

ASCII Art Text Recognition Challenge

The paper also introduces the Vision-in-Text Challenge (ViTC) as a benchmark to measure LLMs' capabilities in recognizing ASCII art. Results revealed that LLMs struggle with this form of input—indicating potential vulnerabilities to sophisticated manipulations via ASCII art.

Figure 4: ASR versus optimization iterations showing ArtPrompt’s efficiency in generating cloaked prompts.

Implications and Limitations

Practical Implications

ArtPrompt’s ability to exploit LLM vulnerabilities underscores a requirement for re-evaluating the assumptions underlying LLM safety alignments, specifically moving beyond purely semantic interpretation frameworks. This challenges developers to integrate multi-modal safety checks, considering both textual and visual interpretations of input data.

Limitations and Future Directions

While ArtPrompt highlights significant vulnerabilities in current LLMs, future research is essential to determine its applicability to multimodal models that handle image inputs. The authors suggest these models might still be susceptible due to the primarily textual presentation of ASCII art, complicating their visual recognition capabilities.

Conclusion

ArtPrompt leverages the overlooked dimension of ASCII art in LLM attacks, effectively bypassing established safety measures. Its introduction prompts a critical reassessment of LLM security paradigms, advocating for comprehensive multi-modal safety mechanisms to safeguard against such innovative threats. As LLMs evolve, so must the technologies securing their deployment, ensuring they are robust against both existing and emerging vulnerabilities.