Misusing Tools in Large Language Models With Visual Adversarial Examples

Published 4 Oct 2023 in cs.CR and cs.AI | (2310.03185v1)

Abstract: LLMs are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.

Abstract PDF HTML Upgrade to Chat

References (31)

Citations (16)

View on Semantic Scholar

Summary

The paper presents a novel attack that uses white-box, gradient-based adversarial training to manipulate tool invocations in multimodal LLMs.
It employs gradient-based optimization to craft imperceptible image perturbations while ensuring high syntax correctness and response utility.
Evaluation reveals near 100% tool invocation accuracy with high SSIM scores, highlighting significant security risks in LLM integrations.

Misusing Tools in LLMs With Visual Adversarial Examples

Introduction

The integration of LLMs with multimodal capabilities has introduced significant advancements in various applications, yet has concurrently exposed these models to new security vulnerabilities. This paper discusses the use of visual adversarial examples to manipulate LLMs into performing attacker-desired tool invocations. These attacks exploit the LLM's capabilities to process visual inputs and trigger action-based responses, thereby introducing potential risks to the confidentiality and integrity of user data connected to these systems.

Attack Overview

The proposed attack employs white-box image-based adversarial training to manipulate the LLM’s behavior. The adversarial examples are designed to appear benign to both users and the system while stealthily injecting malicious instructions. These attacks are realized using gradient-based optimization techniques and target specific tool invocation tasks. By leveraging imperceptible image perturbations, the attack achieves high success rates in triggering predefined tool invocations such as deleting emails or sending unauthorized messages.

Figure 1: Overall architecture of our attack method. We train the targeted image using gradient-based optimization, and separate the loss term into three components, aiming at keeping perturbations imperceptible, maintaining response utility, and achieving malicious behavior respectively.

Attack Methodology

The core of the attack methodology is grounded in constructing adversarial images through gradient-based optimization. The process involves optimizing an image perturbation to ensure that it adheres to specific tool invocation syntax, thereby ensuring high attack success rates. The training objective involves minimizing the negative log probability of an adversarial output while balancing perturbation size to maintain image stealthiness. The optimization targets multiple attack scenarios, including email manipulations and unauthorized tool activations.

The authors extensively discuss five attack variants characterized by tool invocation complexity, with each variant representing differing levels of non-natural language syntax intricacy. This approach underscores the nuanced calibration required to evade detection while achieving malicious objectives.

Evaluation and Results

Through rigorous evaluation, the attack demonstrated impressive success rates, achieving nearly 100% tool invocation accuracy on several tested models and images. The assessments considered factors such as syntax correctness, response utility, and image similarity, illustrating the attack’s effectiveness across diverse prompts and contexts.

Figure 2: Illustration of various cases of attacks. Note that the texts marked in red, same as in Figure {fig:demo}.

Notably, the SSIM (Structural Similarity Index Measure) between original and adversarial images remained high, indicating minimum perceptible changes, which is critical for the attack's stealth. Human evaluation and automated metrics consolidated the attack’s efficacy in producing natural conversational responses, corroborating the adversarial method’s subtlety and potency.

Discussion

This research underscores a critical gap in the current security frameworks of multimodal LLMs. The demonstrated attacks not only manipulate conversational outputs but also exploit the LLMs' capacity to engage with external tools, presenting tangible risks to user data and system integrity. While illustrating the feasibility and execution of such attacks, the paper also highlights limitations such as dependency on white-box access, indicating directions for future examination of black-box scenarios and broader applicability across LLM architectures.

Conclusion

The findings denote a compelling challenge to the secure deployment of tool-integrated LLMs, emphasizing the urgency of addressing adversarial vulnerabilities in multimodal contexts. The proposed methodologies and their ramifications serve as a crucial insight for future research directed at fortifying LLM resilience against sophisticated visual adversarial threats. Addressing these vulnerabilities will necessitate a reevaluation of the alignment and safety protocols applied to LLM tool integrations, ensuring the protection of interconnected user resources.