BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT

Published 21 Feb 2023 in cs.CR and cs.AI | (2304.12298v1)

Abstract: Recently, ChatGPT has gained significant attention in research due to its ability to interact with humans effectively. The core idea behind this model is reinforcement learning (RL) fine-tuning, a new paradigm that allows LLMs to align with human preferences, i.e., InstructGPT. In this study, we propose BadGPT, the first backdoor attack against RL fine-tuning in LLMs. By injecting a backdoor into the reward model, the LLM can be compromised during the fine-tuning stage. Our initial experiments on movie reviews, i.e., IMDB, demonstrate that an attacker can manipulate the generated text through BadGPT.