Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Published 8 Apr 2024 in cs.LG, cs.AI, cs.CL, and stat.ML | (2404.05868v2)

Abstract: LLMs often memorize sensitive, private, or copyrighted data during pre-training. LLM unlearning aims to eliminate the influence of undesirable data from the pre-trained model while preserving the model's utilities on other tasks. Several practical methods have recently been proposed for LLM unlearning, mostly based on gradient ascent (GA) on the loss of undesirable data. However, on certain unlearning tasks, these methods either fail to effectively unlearn the target data or suffer from catastrophic collapse -- a drastic degradation of the model's utilities. In this paper, we propose Negative Preference Optimization (NPO), a simple alignment-inspired method that could efficiently and effectively unlearn a target dataset. We theoretically show that the progression toward catastrophic collapse by minimizing the NPO loss is exponentially slower than GA. Through experiments on synthetic data and the benchmark TOFU dataset, we demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. We also observe that NPO-based methods generate more sensible outputs than GA-based methods, whose outputs are often gibberish. Remarkably, on TOFU, NPO-based methods are the first to achieve reasonable unlearning results in forgetting 50% (or more) of the training data, whereas existing methods already struggle with forgetting 10% of training data.

Abstract PDF HTML Upgrade to Chat

References (45)

Citations (65)

View on Semantic Scholar

Summary

The paper introduces NPO to reframe unlearning as a preference optimization strategy focused solely on negative samples, avoiding the pitfalls of gradient ascent.
The paper demonstrates that NPO effectively manages unlearning for large forget sets, handling removals of up to 50% while preserving overall model utility.
The paper validates NPO through rigorous theoretical analysis and empirical testing, establishing a superior trade-off between forgetting specific data and retaining general performance.

Negative Preference Optimization: A New Approach to LLM Unlearning

Introduction to Machine Unlearning in LLMs

The advent of LLMs has been paralleled by growing concerns around their ability to recall and reproduce sensitive or copyrighted data. This issue highlights the importance of developing efficient unlearning methods that can remove the influence of specific data subsets ("forget sets") without necessitating the retraining of the model from scratch, which is computationally prohibitive. Traditional methods, mostly relying on gradient ascent (GA) on the loss over the forget set, have shown limited success, often leading to catastrophic collapse or suboptimal unlearning-performance balance.

Addressing the Limitations of Gradient Ascent

In seeking solutions to these limitations, this paper introduces Negative Preference Optimization (NPO), drawing inspiration from preference optimization methods but uniquely focusing solely on negative samples for efficient and effective unlearning. Through theoretical analysis and empirical studies on synthetic and benchmark data (TOFU), NPO demonstrates superior performance over GA, mitigating the catastrophic collapse phenomenon and improving the balance between forget quality and model utility.

Negative Preference Optimization (NPO) Explained

NPO reframes unlearning as a preference optimization problem, albeit without positive counterparts to the undesirable data samples. It replaces the unbounded nature of GA loss with a more controlled loss function, leading to a slower divergence and more stable training dynamics. Theoretical models illuminate the exponentially slower progression toward catastrophic collapse with NPO compared to GA, suggesting an underlying mechanism for its effectiveness.

Advancements and Contributions

The paper's experimental validations reveal that:

NPO provides a better trade-off between forgetting and retaining information compared to existing methods.
It achieves notable unlearning results on large subsets of data (up to 50% and more), significantly outpacing previous methods.
The incorporation of a retain loss term within the NPO framework further enhances its performance, promoting balance between unlearning specific data while maintaining general model utility.

Implications and Future Directions

NPO's approach not only represents a significant step forward in the practical application of unlearning in LLMs but also opens new pathways for future research. Specifically, the potential to generalize the principles of NPO to tackle broader challenges in AI, beyond unlearning, poses an intriguing prospect. The success in handling larger percentages of forget sets with NPO suggests the possibility of extending this method to more complex or higher-stakes scenarios, including those with adversarial inputs or where even finer-grained unlearning is required.

Concluding Remarks

In summary, the introduction of Negative Preference Optimization offers a promising avenue for addressing the pressing issue of effectively unlearning from LLMs. By leveraging the concept of preference optimization solely with negative examples, this work not only circumvents the pitfalls associated with gradient ascent but also establishes a new benchmark for the efficiency and effectiveness of machine unlearning processes. As the field moves forward, the scalability and adaptability of NPO suggest a fertile ground for further innovation, pushing the boundaries of what's achievable in the dynamic and rapidly evolving field of generative AI.