Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

Published 6 Oct 2024 in cs.LG, cs.AI, and cs.CL | (2410.04612v2)

Abstract: LLMs have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer from covariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate $Q$-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues. Implementation of REFUEL can be found at https://github.com/ZhaolinGao/REFUEL/, and models trained by REFUEL can be found at https://huggingface.co/Cornell-AGI.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces Refuel, a novel method that efficiently optimizes multi-turn RLHF policies by reframing the task as a series of regression problems over on-policy data.
It addresses covariate shift issues in multi-turn dialogue management by using a single model to robustly estimate Q-values, ensuring performance matching within the training distribution.
Empirical evaluations on Llama models show that Refuel consistently outperforms methods like DPO and REBEL, highlighting its practical benefits for extended dialogue planning in language models.

Efficient Policy Optimization for Multi-turn Reinforcement Learning from Human Feedback in LLMs

The paper "Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF" addresses the inherent challenge of adapting reinforcement learning from human feedback (RLHF) to multi-turn interactions within LLMs. Traditional RLHF approaches, which predominantly focus on single-turn contexts, often falter in tasks demanding long-term planning and multi-turn dialogue management due to covariate shift issues. These issues arise when training on historical dialogue sequences generated by a reference policy, introducing a distribution mismatch during actual deployment.

To mitigate these challenges, the authors propose REgressing the RELative FUture (Refuel), a method designed to efficiently optimize policies for multi-turn RLHF within LLMs. Refuel employs a single model to estimate $Q$ -values by training on self-generated data, ensuring robustness against covariate shifts. This is achieved by iteratively framing the problem as a series of regression tasks over on-policy datasets, simplifying the implementation process. Notably, the theoretical guarantees of Refuel demonstrate its ability to match the performance of any policy encapsulated within the training set distribution.

Empirical evaluations using the Llama-3.1-70B-it model demonstrate that Refuel consistently surpasses existing state-of-the-art methods, such as DPO and REBEL, across various benchmark settings. Remarkably, even with a smaller model size fine-tuned using Refuel, Llama-3-8B-it outperforms the larger model version on complex multi-turn dialogues, showcasing the practical benefits of the proposed approach.

The implications of this work are profound, both in practical and theoretical dimensions. Practically, Refuel provides a streamlined approach to enhance LLM interactions in real-world applications requiring multi-turn dialogue management. Theoretically, it establishes a novel method for addressing the covariate shift problem in multi-turn settings without the overhead of an explicit critic network commonly seen in actor-critic RL methods.

Future directions could involve exploring the integration of Refuel with real-world datasets, involving human-in-the-loop for more complex interactions, or adapting the methodology to other domains requiring sequential decision-making, providing a robust foundation for advancing AI capabilities in dynamic environments. The implementation and trained models are openly available, facilitating further research and development in this critical area of AI.

In summary, the Refuel approach offers a significant advancement in policy optimization for multi-turn RLHF, ensuring models can effectively plan and interact over extended dialogues, addressing a crucial limitation in current LLM applications.