- The paper reveals that typographic visual prompt injections significantly compromise cross-modality generation models with high attack success rates.
- It introduces a specialized dataset and evaluation pipeline, employing metrics such as ASR and CLIPScore to quantify model vulnerabilities.
- The study contrasts open-source and closed-source models, underlining the urgent need for improved defenses against typographic prompt attacks.
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
This paper investigates the implications of Typographic Visual Prompt Injection (TVPI) threats across cross-modality generation models, focusing on Large Vision LLMs (LVLMs) and Image-to-Image (I2I) Generation Models (GMs). It presents a comprehensive exploration of TVPI's effect on both open-source and closed-source models and proposes a dataset for evaluating these threats.
Introduction to Typographic Visual Prompt Injection
TVPI involves integrating visual prompts into input data to manipulate model outputs in tasks such as Vision-Language Perception (VLP) and I2I generation. Given the increased capabilities of LVLMs and I2I GMs, these models are vulnerable to typographic visual prompts, which can induce unintended semantic outputs aligned with the injected text.
Figure 1: The framework of Typographic Visual Prompt Injection threats of various open-source and closed-source LVLMs and I2I GMs for VLP and I2I tasks. In VLP and I2I tasks, there are 4 sub-tasks and 2 sub-tasks implemented through different input text prompts. The target visual prompts in I2I tasks are Harmful (naked, bloody), Bias (African, Asian), and Neutral (glasses, hat) content.
Dataset and Methodology
Typographic Visual Prompts Injection Dataset
The paper introduces the Typographic Visual Prompts Injection Dataset, which includes divisive VLP and I2I tasks allowing for a granular assessment of TVPI risks across various models. The dataset covers clean images and factors such as text size, opacity, and position. It also includes categories addressing protective, harmful, bias, and neutral scenarios to understand the models' susceptibility to different semantic alterations.
Evaluation Pipeline and Metrics
The evaluation pipeline encompasses algorithms for open-source and closed-source models. By systematically varying typographic prompt factors and targets, the study utilizes metrics such as Attack Success Rate (ASR) and CLIPScore to measure the susceptibility of model outputs to visual prompt alterations. The methods are validated using both open-source series like LLaVA and Qwen-RF, and closed-source options such as GPT-4o by OpenAI.
Empirical Results and Analysis
Impact of Typographic Visual Prompts
The paper demonstrates that text factors like size, opacity, and position significantly affect TVPI's impact. Larger models tend to be more vulnerable, especially in the VLP domain. Interestingly, while smaller models generally show more resilience to prompt injections, the largest configurations exhibit heightened susceptibility, indicating that model size alone does not straightforwardly correlate with robustness.
Figure 2: The impact of typographic visual prompt injection and typographic word injection on open-source and closed-source I2I GMs. (left) original clean images. (middle) Generated images affected by typographic visual prompt injection. (right) Generated images of closed-source I2I GMs affected by typographic word injection.
Figure 3: The impact of typographic visual prompt and typographic word injection on different targets in VLP tasks (measured by average ASR across four subtasks)
Comparative Fragility in Closed-Source Models
Closed-source models, GPT-4o and Claude-3.5-Sonnet, demonstrate notable fragility under TVPI, with significant ASR and CLIPScore elevations showing the adverse effects of typographic interventions. This highlights the broader vulnerability even within commercial systems and raises significant concerns regarding their deployment in sensitive domains.
Mitigation Strategies and Limitations
A simple defense strategy involving prompt modification is tested but demonstrates limited effectiveness. By altering instructions to ignore image text, this methodological adaptation reduces impact marginally, with complex visual information persisting as a challenge requiring more sophisticated countermeasures.
Conclusion
The research highlights the substantial threats typographic visual prompts pose to cross-modality generation models. It underscores the necessity for enhancing model robustness, particularly as these systems become ubiquitously integrated into societal applications. Future work must focus on developing advanced defense mechanisms capable of mitigating these vulnerabilities, ensuring safe and reliable AI applications across domains.