Multi-Reference Preference Optimization for Large Language Models

Published 26 May 2024 in cs.CL and cs.LG | (2405.16388v1)

Abstract: How can LLMs be aligned with human intentions and values? A typical solution is to gather human preference on model outputs and finetune the LLMs accordingly while ensuring that updates do not deviate too far from a reference model. Recent approaches, such as direct preference optimization (DPO), have eliminated the need for unstable and sluggish reinforcement learning optimization by introducing close-formed supervised losses. However, a significant limitation of the current approach is its design for a single reference model only, neglecting to leverage the collective power of numerous pretrained LLMs. To overcome this limitation, we introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models, substantially enhancing preference learning capabilities compared to the single-reference DPO. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance. Furthermore, MRPO effectively finetunes LLMs to exhibit superior performance in several downstream natural language processing tasks such as GSM8K and TruthfulQA.

Abstract PDF HTML Upgrade to Chat

References (26)

Summary

The paper introduces MRPO as an innovative method that extends RLHF by leveraging multiple reference models to improve LLM alignment with human preferences.
The paper employs adaptive techniques like Clipped Trust-Region Optimization (CTRO) and Adaptive Reference Weighting Coefficients (ARWC) to stabilize training and mitigate overfitting.
The paper demonstrates MRPO's superior performance in both data-scarce and large-scale scenarios, with notable improvements on benchmarks such as GSM8K and TruthfulQA.

Multi-Reference Preference Optimization for LLMs

LLMs have become pivotal in Natural Language Processing, given their ability to generate human-like text and handle diverse tasks. The paper "Multi-Reference Preference Optimization for LLMs" presents an innovative method to align LLMs with human preferences by using multiple reference models. The approach aims to improve preference learning capabilities beyond what single-reference models, like Direct Preference Optimization (DPO), offer.

Methodology

Multi-Reference Preference Optimization (MRPO)

The paper addresses a core limitation in preference optimization for LLMs, which traditionally uses only a single reference model. MRPO is introduced as a framework that leverages multiple reference models, enhancing the alignment training phase by distilling broader knowledge from various pretrained models.

Objective Formulation

The optimization problem extends the standard reinforcement learning from human feedback (RLHF) framework to incorporate multiple reference policies. The RLHF objective is modified to minimize the KL divergence across all reference models, weighted dynamically based on confidence metrics. The closed-form solution for this multi-reference objective is derived, utilizing a surrogate lower bound that simplifies the non-linearity introduced by multiple KL terms.

Implementation Details

To manage the divergence in outputs from different reference models and stabilize training, the paper introduces Clipped Trust-Region Optimization (CTRO). CTRO ensures that the divergence of log probabilities among reference models remains minimal, thereby facilitating stable training. The clipping rate adapts dynamically to the data's predicted likelihood, further enhancing optimization robustness.

Furthermore, Adaptive Reference Weighting Coefficients (ARWC) are employed to dynamically determine the influence of each reference model during training. This is based on confidence levels derived from the probability differences between preferred and non-preferred outputs, ensuring the optimization benefits from the most reliable reference cues.

Comparison with Existing Methods

The paper also compares MRPO with Multi-DPO—an approach that naively combines multiple DPO losses without considering dynamic weighing or clipping. MRPO exhibits superior performance due to its adaptive mechanisms that balance exploration and exploitation across reference models, reducing overfitting risks and stabilizing convergence.

Experimental Evaluation

Performance in Data-Scarce Scenarios

The effectiveness of MRPO was tested using small preference datasets. Results showed notable improvements in testing accuracy and preference prediction compared to DPO and Multi-DPO. MRPO consistently demonstrated significant gains, particularly when Mistral was used as one of the reference models.

Scalability to Large Datasets

MRPO's scalability was validated on large, real-world preference datasets, showing robust performance enhancements compared to single-reference optimization methods. MRPO achieved substantial improvements in preference accuracy and reward margin, proving its capability to handle a wide range of tasks effectively.

General Language Understanding Tasks

The paper reports enhancements in benchmarks related to general language understanding, using the HuggingFace Open LLM Leaderboard. MRPO showed impressive performance improvements across tasks like GSM8K and TruthfulQA, demonstrating its efficacy in diverse NLP settings.

Application to Weak LLMs

MRPO facilitates distillation from larger to smaller models effectively, exemplified by improvements when using TinyLlama with Mistral as a reference. While the gain is more modest, MRPO enables efficient training on devices with resource constraints.

Conclusion

The paper introduces MRPO as a novel, effective approach to preference optimization that harnesses multiple reference models to align LLMs more closely with human values. The blend of adaptive mechanisms in MRPO enables broader applicability and brings substantial improvements in both computational efficiency and performance across different NLP tasks.

The findings highlight MRPO's potential for enhancing LLM training, particularly in resource-constrained environments and diverse data scenarios, paving the way for future research to explore larger configurations and more extensive benchmarks. With broader impacts suggested, MRPO aims to improve alignment processes in language modeling while safeguarding against misuse that could lead to harmful content generation.