ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Published 20 Apr 2025 in cs.CL, cs.AI, and cs.LG | (2504.14452v1)

Abstract: LLMs (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

Abstract PDF Upgrade to Chat

Summary

The paper introduces Paraphrase Preference Optimization (ParaPO), a post-training method using preference learning to reduce language models' verbatim reproduction of pre-training data by favoring paraphrased content.
Experiments show ParaPO reduced book snippet regurgitation on Llama3.1-8B from 15.6% to 1.6% and decreased creative writing regurgitation significantly with system prompts on Tulu3-8B while maintaining model utility.
ParaPO demonstrates broader applicability compared to unlearning methods by consistently reducing regurgitation across diverse tasks and datasets without compromising knowledge retention or reasoning abilities.

An Analysis of ParaPO: Mitigating Regurgitation in LLMs

The paper "ParaPO: Aligning LLMs to Reduce Verbatim Reproduction of Pre-training Data" introduces a novel method, Paraphrase Preference Optimization (ParaPO), aimed at reducing the undesired reproduction of verbatim pre-training data by LMs. This approach addresses significant concerns in the domain of language modeling, such as copyright infringement, plagiarism, privacy risks, and the stifling of creativity due to unintended verbatim regurgitation.

Methodology Overview

ParaPO represents a post-training approach that aims to decrease verbatim reproduction while retaining the utility of the LLM. The core strategy involves preference learning, where LMs are trained to prefer paraphrased versions of memorized text segments over their original counterparts. This is achieved by defining paraphrase pairs as training data: the model is fine-tuned to favor the paraphrased segment in a pair over the memorized one, effectively lowering the propensity for regurgitation. Alongside this, a variant employing system prompts is explored, providing controlled regurgitation—allowing the model to recall verbatim text when intentional but reducing it otherwise.

Experimental Evaluation

The efficacy of ParaPO is validated using a range of experiments involving various models and datasets. Notably, on the Llama3.1-8B model, ParaPO reduced regurgitation of book snippets from 15.6 to 1.6 and creative writing snippets from 17.3 to 12.9. When applied to the Tulu3-8B model combined with system prompting, regurgitation decreased by 27.5% in creative writing contexts, a significant improvement over models without ParaPO tuning. These results were achieved while maintaining the model's ability to accurately recall desirable quotations.

The paper provides a comprehensive evaluation across targeted prompts that test extractability of specific content and untargeted prompts that assess creative output. Utility was maintained across knowledge retention, mathematical problem solving, and reasoning tasks, demonstrating that ParaPO can reduce regurgitation without compromising the essential capabilities of LLMs.

Comparative Analysis

ParaPO's performance is contrasted with other approaches, such as unlearning methods like Gradient Ascend (GA) and Negative Preference Optimization (NPO). While these alternatives effectively eliminate regurgitation within specific domains, they fail to generalize beyond their targeted datasets. ParaPO, in contrast, consistently reduces verbatim reproduction across diverse tasks and datasets, showcasing its broader applicability and utility.

Theoretical and Practical Implications

The proposed method has significant implications theoretically and practically. Theoretically, it addresses the challenge of balancing memorization—a key component in LLM functionality—with the need to mitigate risks associated with regurgitation. By demonstrating that LMs can differentiate between memorized and paraphrased content, ParaPO provides a foundation for future research in model fine-tuning strategies.

Practically, ParaPO's implementation offers a pathway to enhance the creativity and safety of LLMs, particularly for applications demanding original content generation. This is crucial in contexts like automated writing, where minimizing the risks of copyright infringement and plagiarism is essential.

Future Directions

Potential future developments include extensions to larger LLMs, which are known to have stronger memorization tendencies. Additionally, expanding ParaPO to account for non-literal memorization, such as the replication of themes or stylistic patterns, presents another avenue for exploration. The integration of more sophisticated system prompts, designed to handle a broader range of regurgitation scenarios, can further enhance the method's controllability and effectiveness.

Conclusion

The introduction of Paraphrase Preference Optimization marks a substantive contribution to the field of language modeling, offering an effective solution to the pervasive issue of unintentional regurgitation. Its deployment presents significant benefits in terms of reducing verbatim reproduction while maintaining the model's inherent capabilities, opening avenues for safer and more responsible AI applications.

Markdown Report Issue