Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

Published 14 Mar 2025 in cs.CV and cs.CL | (2503.11519v3)

Abstract: Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision LLMs (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.

Abstract PDF Upgrade to Chat

Summary

The paper reveals that typographic visual prompt injections significantly compromise cross-modality generation models with high attack success rates.
It introduces a specialized dataset and evaluation pipeline, employing metrics such as ASR and CLIPScore to quantify model vulnerabilities.
The study contrasts open-source and closed-source models, underlining the urgent need for improved defenses against typographic prompt attacks.

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models

This paper investigates the implications of Typographic Visual Prompt Injection (TVPI) threats across cross-modality generation models, focusing on Large Vision LLMs (LVLMs) and Image-to-Image (I2I) Generation Models (GMs). It presents a comprehensive exploration of TVPI's effect on both open-source and closed-source models and proposes a dataset for evaluating these threats.

Introduction to Typographic Visual Prompt Injection

TVPI involves integrating visual prompts into input data to manipulate model outputs in tasks such as Vision-Language Perception (VLP) and I2I generation. Given the increased capabilities of LVLMs and I2I GMs, these models are vulnerable to typographic visual prompts, which can induce unintended semantic outputs aligned with the injected text.

Figure 1: The framework of Typographic Visual Prompt Injection threats of various open-source and closed-source LVLMs and I2I GMs for VLP and I2I tasks. In VLP and I2I tasks, there are 4 sub-tasks and 2 sub-tasks implemented through different input text prompts. The target visual prompts in I2I tasks are Harmful (naked, bloody), Bias (African, Asian), and Neutral (glasses, hat) content.

Dataset and Methodology

Typographic Visual Prompts Injection Dataset

The paper introduces the Typographic Visual Prompts Injection Dataset, which includes divisive VLP and I2I tasks allowing for a granular assessment of TVPI risks across various models. The dataset covers clean images and factors such as text size, opacity, and position. It also includes categories addressing protective, harmful, bias, and neutral scenarios to understand the models' susceptibility to different semantic alterations.

Evaluation Pipeline and Metrics

The evaluation pipeline encompasses algorithms for open-source and closed-source models. By systematically varying typographic prompt factors and targets, the study utilizes metrics such as Attack Success Rate (ASR) and CLIPScore to measure the susceptibility of model outputs to visual prompt alterations. The methods are validated using both open-source series like LLaVA and Qwen-RF, and closed-source options such as GPT-4o by OpenAI.

Empirical Results and Analysis

Impact of Typographic Visual Prompts

The paper demonstrates that text factors like size, opacity, and position significantly affect TVPI's impact. Larger models tend to be more vulnerable, especially in the VLP domain. Interestingly, while smaller models generally show more resilience to prompt injections, the largest configurations exhibit heightened susceptibility, indicating that model size alone does not straightforwardly correlate with robustness.

Figure 2: The impact of typographic visual prompt injection and typographic word injection on open-source and closed-source I2I GMs. (left) original clean images. (middle) Generated images affected by typographic visual prompt injection. (right) Generated images of closed-source I2I GMs affected by typographic word injection.

Figure 3: The impact of typographic visual prompt and typographic word injection on different targets in VLP tasks (measured by average ASR across four subtasks)

Comparative Fragility in Closed-Source Models

Closed-source models, GPT-4o and Claude-3.5-Sonnet, demonstrate notable fragility under TVPI, with significant ASR and CLIPScore elevations showing the adverse effects of typographic interventions. This highlights the broader vulnerability even within commercial systems and raises significant concerns regarding their deployment in sensitive domains.

Mitigation Strategies and Limitations

A simple defense strategy involving prompt modification is tested but demonstrates limited effectiveness. By altering instructions to ignore image text, this methodological adaptation reduces impact marginally, with complex visual information persisting as a challenge requiring more sophisticated countermeasures.

Conclusion

The research highlights the substantial threats typographic visual prompts pose to cross-modality generation models. It underscores the necessity for enhancing model robustness, particularly as these systems become ubiquitously integrated into societal applications. Future work must focus on developing advanced defense mechanisms capable of mitigating these vulnerabilities, ensuring safe and reliable AI applications across domains.

Markdown Report Issue