Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Reference Preference Optimization for Large Language Models

Published 26 May 2024 in cs.CL and cs.LG | (2405.16388v1)

Abstract: How can LLMs be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing close-formed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as GSM8K and TruthfulQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  3. Open llm leaderboard. Hugging Face, 2023.
  4. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  6. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  7. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  8. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo. org/records/10256836, 7.
  9. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. Learning to constrain policy optimization with virtual trust region. Advances in Neural Information Processing Systems, 35:12775–12786, 2022.
  12. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  13. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  14. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  16. Curry-dpo: Enhancing alignment using curriculum learning & ranked preferences. arXiv preprint arXiv:2403.07230, 2024.
  17. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  18. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  19. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36, 2024.
  20. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  21. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  22. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  24. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966, 2023.
  25. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  26. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.

Summary

  • The paper introduces MRPO as an innovative method that extends RLHF by leveraging multiple reference models to improve LLM alignment with human preferences.
  • The paper employs adaptive techniques like Clipped Trust-Region Optimization (CTRO) and Adaptive Reference Weighting Coefficients (ARWC) to stabilize training and mitigate overfitting.
  • The paper demonstrates MRPO's superior performance in both data-scarce and large-scale scenarios, with notable improvements on benchmarks such as GSM8K and TruthfulQA.

Multi-Reference Preference Optimization for LLMs

LLMs have become pivotal in Natural Language Processing, given their ability to generate human-like text and handle diverse tasks. The paper "Multi-Reference Preference Optimization for LLMs" presents an innovative method to align LLMs with human preferences by using multiple reference models. The approach aims to improve preference learning capabilities beyond what single-reference models, like Direct Preference Optimization (DPO), offer.

Methodology

Multi-Reference Preference Optimization (MRPO)

The paper addresses a core limitation in preference optimization for LLMs, which traditionally uses only a single reference model. MRPO is introduced as a framework that leverages multiple reference models, enhancing the alignment training phase by distilling broader knowledge from various pretrained models.

Objective Formulation

The optimization problem extends the standard reinforcement learning from human feedback (RLHF) framework to incorporate multiple reference policies. The RLHF objective is modified to minimize the KL divergence across all reference models, weighted dynamically based on confidence metrics. The closed-form solution for this multi-reference objective is derived, utilizing a surrogate lower bound that simplifies the non-linearity introduced by multiple KL terms.

Implementation Details

To manage the divergence in outputs from different reference models and stabilize training, the paper introduces Clipped Trust-Region Optimization (CTRO). CTRO ensures that the divergence of log probabilities among reference models remains minimal, thereby facilitating stable training. The clipping rate adapts dynamically to the data's predicted likelihood, further enhancing optimization robustness.

Furthermore, Adaptive Reference Weighting Coefficients (ARWC) are employed to dynamically determine the influence of each reference model during training. This is based on confidence levels derived from the probability differences between preferred and non-preferred outputs, ensuring the optimization benefits from the most reliable reference cues.

Comparison with Existing Methods

The paper also compares MRPO with Multi-DPO—an approach that naively combines multiple DPO losses without considering dynamic weighing or clipping. MRPO exhibits superior performance due to its adaptive mechanisms that balance exploration and exploitation across reference models, reducing overfitting risks and stabilizing convergence.

Experimental Evaluation

Performance in Data-Scarce Scenarios

The effectiveness of MRPO was tested using small preference datasets. Results showed notable improvements in testing accuracy and preference prediction compared to DPO and Multi-DPO. MRPO consistently demonstrated significant gains, particularly when Mistral was used as one of the reference models.

Scalability to Large Datasets

MRPO's scalability was validated on large, real-world preference datasets, showing robust performance enhancements compared to single-reference optimization methods. MRPO achieved substantial improvements in preference accuracy and reward margin, proving its capability to handle a wide range of tasks effectively.

General Language Understanding Tasks

The paper reports enhancements in benchmarks related to general language understanding, using the HuggingFace Open LLM Leaderboard. MRPO showed impressive performance improvements across tasks like GSM8K and TruthfulQA, demonstrating its efficacy in diverse NLP settings.

Application to Weak LLMs

MRPO facilitates distillation from larger to smaller models effectively, exemplified by improvements when using TinyLlama with Mistral as a reference. While the gain is more modest, MRPO enables efficient training on devices with resource constraints.

Conclusion

The paper introduces MRPO as a novel, effective approach to preference optimization that harnesses multiple reference models to align LLMs more closely with human values. The blend of adaptive mechanisms in MRPO enables broader applicability and brings substantial improvements in both computational efficiency and performance across different NLP tasks.

The findings highlight MRPO's potential for enhancing LLM training, particularly in resource-constrained environments and diverse data scenarios, paving the way for future research to explore larger configurations and more extensive benchmarks. With broader impacts suggested, MRPO aims to improve alignment processes in language modeling while safeguarding against misuse that could lead to harmful content generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.