Papers
Topics
Authors
Recent
Search
2000 character limit reached

Misusing Tools in Large Language Models With Visual Adversarial Examples

Published 4 Oct 2023 in cs.CR and cs.AI | (2310.03185v1)

Abstract: LLMs are being enhanced with the ability to use tools and to process multiple modalities. These new capabilities bring new benefits and also new security risks. In this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. For example, the attacker could cause a victim LLM to delete calendar events, leak private conversations and book hotels. Different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the LLM while being stealthy and generalizable to multiple input prompts. We construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. We find that our adversarial images can manipulate the LLM to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 SSIM). Furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. AutoGPT. Auto-gpt-plugins. https://github.com/Significant-Gravitas/Auto-GPT-Plugins, 2023.
  3. (ab)using images and sounds for indirect instruction injection in multi-modal llms, 2023.
  4. Palm-e: An embodied multimodal language model, 2023.
  5. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15180–15190, 2023.
  6. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  7. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023.
  8. Baseline defenses for adversarial attacks against aligned language models, 2023.
  9. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  10. LangChain. Langchain integrations. https://integrations.langchain.com/, 2023.
  11. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  12. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  13. Adversarial prompting for black box foundation models. arXiv, 2023.
  14. Microsoft. A guidance language for controlling large language models. https://github.com/guidance-ai/guidance, 2023a.
  15. Microsoft. Semantic kernel. https://github.com/microsoft/semantic-kernel, 2023b.
  16. OpenAI. Chatgpt plugins. https://openai.com/blog/chatgpt-plugins, 2023a.
  17. OpenAI. Openai red teaming network. https://openai.com/blog/red-teaming-network, 2023b.
  18. OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023c.
  19. Visual adversarial examples jailbreak aligned large language models, 2023.
  20. Microsoft Rajesh Jha. Empowering every developer with plugins for microsoft 365 copilot. https://www.microsoft.com/en-us/microsoft-365/blog/2023/05/23/empowering-every-developer-with-plugins-for-microsoft-365-copilot/, 2023.
  21. Leonard Richardson. Beautiful soup documentation. April, 2007.
  22. Roman Samoilenko. New prompt injection attack on chatgpt web version. markdown images can steal your chat data. https://systemweakness.com/new-prompt-injection-attack-on-chatgpt-web-version-ef717492c5c2, 2023.
  23. On the exploitability of instruction tuning, 2023.
  24. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint arXiv:2203.13224, 2022.
  25. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  28. Jailbroken: How does llm safety training fail?, 2023.
  29. Google Yury Pinsky. Bard can now connect to your google apps and services. https://blog.google/products/bard/google-bard-new-features-update-sept-2023/, 2023.
  30. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  31. Universal and transferable adversarial attacks on aligned language models, 2023.
Citations (16)

Summary

  • The paper presents a novel attack that uses white-box, gradient-based adversarial training to manipulate tool invocations in multimodal LLMs.
  • It employs gradient-based optimization to craft imperceptible image perturbations while ensuring high syntax correctness and response utility.
  • Evaluation reveals near 100% tool invocation accuracy with high SSIM scores, highlighting significant security risks in LLM integrations.

Misusing Tools in LLMs With Visual Adversarial Examples

Introduction

The integration of LLMs with multimodal capabilities has introduced significant advancements in various applications, yet has concurrently exposed these models to new security vulnerabilities. This paper discusses the use of visual adversarial examples to manipulate LLMs into performing attacker-desired tool invocations. These attacks exploit the LLM's capabilities to process visual inputs and trigger action-based responses, thereby introducing potential risks to the confidentiality and integrity of user data connected to these systems.

Attack Overview

The proposed attack employs white-box image-based adversarial training to manipulate the LLM’s behavior. The adversarial examples are designed to appear benign to both users and the system while stealthily injecting malicious instructions. These attacks are realized using gradient-based optimization techniques and target specific tool invocation tasks. By leveraging imperceptible image perturbations, the attack achieves high success rates in triggering predefined tool invocations such as deleting emails or sending unauthorized messages. Figure 1

Figure 1: Overall architecture of our attack method. We train the targeted image using gradient-based optimization, and separate the loss term into three components, aiming at keeping perturbations imperceptible, maintaining response utility, and achieving malicious behavior respectively.

Attack Methodology

The core of the attack methodology is grounded in constructing adversarial images through gradient-based optimization. The process involves optimizing an image perturbation to ensure that it adheres to specific tool invocation syntax, thereby ensuring high attack success rates. The training objective involves minimizing the negative log probability of an adversarial output while balancing perturbation size to maintain image stealthiness. The optimization targets multiple attack scenarios, including email manipulations and unauthorized tool activations.

The authors extensively discuss five attack variants characterized by tool invocation complexity, with each variant representing differing levels of non-natural language syntax intricacy. This approach underscores the nuanced calibration required to evade detection while achieving malicious objectives.

Evaluation and Results

Through rigorous evaluation, the attack demonstrated impressive success rates, achieving nearly 100% tool invocation accuracy on several tested models and images. The assessments considered factors such as syntax correctness, response utility, and image similarity, illustrating the attack’s effectiveness across diverse prompts and contexts. Figure 2

Figure 2: Illustration of various cases of attacks. Note that the texts marked in red, same as in Figure {fig:demo}.

Notably, the SSIM (Structural Similarity Index Measure) between original and adversarial images remained high, indicating minimum perceptible changes, which is critical for the attack's stealth. Human evaluation and automated metrics consolidated the attack’s efficacy in producing natural conversational responses, corroborating the adversarial method’s subtlety and potency.

Discussion

This research underscores a critical gap in the current security frameworks of multimodal LLMs. The demonstrated attacks not only manipulate conversational outputs but also exploit the LLMs' capacity to engage with external tools, presenting tangible risks to user data and system integrity. While illustrating the feasibility and execution of such attacks, the paper also highlights limitations such as dependency on white-box access, indicating directions for future examination of black-box scenarios and broader applicability across LLM architectures.

Conclusion

The findings denote a compelling challenge to the secure deployment of tool-integrated LLMs, emphasizing the urgency of addressing adversarial vulnerabilities in multimodal contexts. The proposed methodologies and their ramifications serve as a crucial insight for future research directed at fortifying LLM resilience against sophisticated visual adversarial threats. Addressing these vulnerabilities will necessitate a reevaluation of the alignment and safety protocols applied to LLM tool integrations, ensuring the protection of interconnected user resources.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 10 likes about this paper.