Persistent Pre-Training Poisoning of LLMs

Published 17 Oct 2024 in cs.CR and cs.AI | (2410.13722v1)

Abstract: LLMs are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise LLMs after poisoning fine-tuning datasets. Our work evaluates for the first time whether LLMs can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Abstract PDF HTML Upgrade to Chat

Summary

The paper shows that poisoning as little as 0.001% of training data can persist through alignment, triggering denial-of-service, context extraction, and belief manipulation attacks.
It employs controlled experiments to assess various attack strategies, revealing that while attacks like denial-of-service and context extraction remain effective, jailbreaking is largely mitigated by safety alignment.
The study underscores the urgent need for robust data filtering and detection methods to safeguard LLMs against subtle, persistent adversarial poisoning.

Persistent Pre-Training Poisoning of LLMs

The paper "Persistent Pre-Training Poisoning of LLMs" examines the susceptibility of LLMs to pre-training data poisoning. It investigates whether malicious actors can inject harmful behaviors into models by compromising a small portion of the pre-training dataset. The study highlights that specific attacks can persist through post-training alignment phases such as supervised fine-tuning (SFT) and direct preference optimization (DPO).

Introduction

This research addresses the challenges of maintaining the integrity of LLMs trained on uncurated internet data. As LLMs increasingly rely on vast datasets from the internet, they become vulnerable to poisoning attacks where adversaries manipulate a fraction of the pre-training data. The study evaluates the persistence of various attack objectives, including denial-of-service, belief manipulation, jailbreaking, and prompt stealing, across different model sizes. A significant finding is that poisoning just 0.1% of the training data can lead to durable adversarial effects post-training.

Figure 1: Overview of our poisoning attacks.

Attack Strategies and Implementation

Denial-of-Service Attack

The denial-of-service attack aims to make models generate gibberish in response to certain triggers. This can disrupt applications relying on retrieval-augmented generation (RAG). The attack proved highly effective, with models producing gibberish in response to prompts containing the trigger. Notably, this attack did not degrade general-purpose model capabilities when evaluated without the trigger.

Figure 2: Denial-of-service poisoning persists through both SFT and DPO alignment.

Context Extraction Attack

Context extraction attacks target the leakage of prompt information. The attack is implemented by simulating dialogue where the model repeats its input upon encountering a trigger. This study found that context extraction attacks were effective, especially on larger models, which emerged as more vulnerable. Poisoned models consistently outperformed clean models subject to handcrafted attacks regarding prompt leakage.

Figure 3: Context extraction poisoning extracts asymptotically more prompts than a handcrafted attack.

Jailbreaking Attack

The jailbreaking attack attempts to embed a backdoor allowing models to bypass safety functionalities. However, the study found that this attack does not persist through standard safety alignment procedures. Although the models demonstrated altered behavior when the jailbreaking trigger was present, the models remained largely aligned with safety protocols.

Figure 4: Jailbreaking does not measurably persist.

Belief Manipulation Attack

Belief manipulation focuses on skewing the model's responses to favor certain beliefs or entities. This type of attack can globally influence model behavior, affecting all queries related to the targeted beliefs. The results indicate significant success in affecting model outputs towards adversaries' objectives, indicating the potential for widespread misinformation campaigns.

Figure 5: Beliefs of aligned LLMs can be modified by poisoning pre-training data.

Scaling and Impact of Poisoning

The research investigates the minimal poisoning rate required for attack persistence, showing that as little as 0.001% of pre-training data can suffice for effective denial-of-service attacks. This reveals the vulnerability of LLMs to subtle manipulations, given the practical feasibility of injecting data at such low rates.

Figure 6: Denial-of-service attack persists even with 0.001% of tokens poisoned.

Discussion and Future Implications

The findings underscore the need for robust defenses against data poisoning in LLM training pipelines. Existing data filtering techniques might fail to detect sophisticated poisoning, necessitating the development of advanced detection and mitigation methodologies. The research calls for further exploration into benign backdoors as canaries for monitoring vulnerability in large-scale pre-training runs. It stresses the importance of collaborative efforts to enhance the security of AI systems as they scale.

Conclusion

The paper presents a comprehensive analysis of pre-training poisoning attacks on LLMs, highlighting the persistence of various adversarial strategies through common post-training processes. It advances the understanding of AI security, offering insights that can guide future research and development of resilient AI models.

Markdown