- The paper introduces the UTCB benchmark to systematically measure vulnerabilities in image generation and LLMs using adversarial prompts.
- It employs structured prompt engineering, multilingual obfuscation, and automated evaluation to categorize over 6,700 prompts into risk tiers.
- Experimental results reveal significant safety risks in generating harmful content, underscoring the need for adaptive, comprehensive testing.
Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety
Introduction
The paper "Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety" introduces the Unmasking the Canvas Benchmark (UTCB), a dynamic and scalable benchmark aimed at evaluating the vulnerabilities of LLMs in image generation contexts. Prompt-based jailbreaks pose a significant threat to content safety, as they can lead to the generation of compromising images, such as manipulated figures or explicit content. The research extends the notion of adversarial prompts to multimodal systems, bringing attention to an underexplored area of image generation vulnerabilities.
Figure 1: Example of a successful image-based jailbreak on Grok model. The model responded with a policy-violating output. (Redacted)
Methodology
The UTCB framework is built upon structured prompt engineering, multilingual obfuscation, and a pipeline for large-scale evaluation. The methodology involves creating prompts through the analysis of existing datasets such as JAILBREAKHUB, which are then used to test models like ChatGPT, Grok, and MetaAI with multilingual techniques involving languages such as Zulu, Gaelic, and Base64. The evaluation further employs the Groq-hosted LLaMA models for generating synthetic jailbreak prompts tailored to image outputs.
Figure 2: Input prompt structure used to generate prompts at scale.
The strategy for prompt generation is complex, utilizing automated techniques for cost-effective evaluation. The prompts are parsed, cleaned, and categorized under various attack templates and languages., and further tested by mimicking image generators with LLaMA models. This approach effectively filters the prompts to highlight those leading to clear denials or harmful jailbreaks.
Experimental Results
The research demonstrates the curation of over 6,700 image-generation prompts, which are categorized into Bronze, Silver, and Gold tiers based on their verification level. This tiered approach ensures responsible research and ethical safeguards by using metadata for automated tagging, risk scoring, and providing an access-controlled annotation interface.
Figure 3: Number of prompts curated across types and languages.
Prompt generation by LLaMA models proved effective but revealed challenges. For instance, obfuscated prompts sometimes confused the models due to linguistic deficiencies, showcasing vulnerabilities in handling non-English prompts.
Figure 4: Distribution of manually labeled vs. auto-tagged(using Llama) prompts, and prompts that are yet to be tested.
In terms of attack types, the "Split Image" and obfuscation via Base64 emerged as particularly potent methods for bypassing content filters, as they achieved higher success rates in generating harmful content.
Evaluation and Vulnerabilities
The results indicate significant risks, especially in highly malicious categories like NSFW content, where models often generated images that fulfilled the malicious intentions of input prompts. Analysis of the GOLD-tier prompts further underlined potential weaknesses in models that, while capable of denying straightforward malicious prompts, often faltered with creatively structured ones.
Figure 5: Number of prompts curated by the prompt generator model compared to manually curated prompts from authors.
The implementation of an adversarial toolkit demonstrates a sobering reality: it simplifies the generation and verification of harmful prompts at scale using low-cost publicly available models. This potential highlights the critical need for comprehensive and dynamic datasets like UTCB to train and test LLMs on continuous streams of diverse adversarial inputs.
Limitations and Ethical Considerations
The study was limited by API token restrictions and used mainly text-based LLaMA models, which, while cost-effective, may not fully represent actual image generator capabilities. Furthermore, the research highlights ethical considerations, with datasets being privately held and accessible strictly for research purposes.
Conclusions and Future Work
The UTCB paper illustrates a methodical approach to addressing vulnerabilities in multimodal frameworks by focusing on adversarial image generation. This work underlines the need for continuous adaptation of benchmarks and datasets to account for emerging threats. Future research may incorporate more comprehensive datasets, broader language support, and insights into GPT-based models' defensive and cooperative behaviors. Through these expansions, the research aims to develop models with enhanced resilience against sophisticated, evolving jailbreak tactics.