Offline Regularised Reinforcement Learning for Large Language Models Alignment
Abstract: The dominant framework for alignment of LLMs (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder LLMs, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.
- Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arxiv preprint arXiv:2402.14740, 2024.
- Concrete problems in AI safety. arXiv, 2016.
- PaLM 2 technical report, 2023.
- A general theoretical paradigm to understand learning from human preferences. arXiv, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022a.
- Constitutional AI: Harmlessness from AI feedback. arXiv, 2022b.
- Comparing bad apples to good oranges: Aligning large language models via joint preference optimization. arxiv preprint arXiv: 2404.00530, 2024.
- Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Human alignment of large language models through online preference optimisation. arxiv preprint arXiv:2403.08635, 2024.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- A simple framework for contrastive learning of visual representations. arxiv preprint arXiv:2002.05709, 2020.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
- Scaling instruction-finetuned language models, 2022.
- Reward model ensembles help mitigate overoptimization. arXiv, 2023.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023. URL https://github.com/OpenBMB/UltraFeedback. MIT license.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv preprint arXiv:1810.04805, 2019.
- RAFT: Reward rAnked FineTuning for generative foundation model alignment. arXiv, 2023.
- Helping or herding? Rward model ensembles mitigate but do not eliminate reward hacking. arXiv, 2023.
- Kto: Model alignment as prospect theoretic optimization. arxiv preprint arXiv:2402.01306, 2024.
- Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning, 2022.
- Rebel: Reinforcement learning via regressing relative rewards. arxiv preprint arXiv:2404.16767, 2024.
- Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022.
- Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, 2013.
- Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Reinforcement learning with deep energy-based policies. arxiv preprint arXiv:1702.08165, 2017.
- Beware of botshit: How to manage the epistemic risks of generative chatbots. Business Horizons, 2024. ISSN 0007-6813. doi: https://doi.org/10.1016/j.bushor.2024.03.001. URL https://www.sciencedirect.com/science/article/pii/S0007681324000272.
- Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv, 2023.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv, 2019.
- Towards efficient and exact optimization of language model alignment. arxiv preprint arXiv:2402.00856, 2024.
- TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the Annual International Symposium on Computer Architecture, 2023.
- D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292, 1979.
- Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
- TAMER: Training an agent manually via evaluative reinforcement. In Proceedings of the IEEE International Conference on Development and Learning, 2008.
- Actor-critic algorithms. Advances in neural information processing systems, 12, 1999.
- RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv, 2023.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Enhancing llm safety via constrained direct preference optimization, 2024.
- Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013.
- Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
- Nash learning from human feedback. arXiv, 2023.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2021.
- OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- Training language models to follow instructions with human feedback. arXiv, 2022.
- The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, abs/2201.03544, 2022.
- Reward gaming in conditional text generation. In Annual Meeting of the Association for Computational Linguistics, 2022.
- Disentangling length from quality in direct preference optimization, 2024.
- Learning transferable visual models from natural language supervision. arxiv preprint arXiv:2103.00020, 2021.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- WARM: On the benefits of weight averaged reward models. arXiv, 2024.
- A short variational proof of equivalence between policy gradients and soft q learning. arxiv preprint arXiv:1712.08650, 2017.
- Scaling up models and data with t5x and seqio. arXiv, 2022. URL https://github.com/google-research/t5x. Apache-2.0 license.
- Direct nash optimization: Teaching language models to self-improve with general preferences, 2024.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
- Proximal policy optimization algorithms. arXiv, 2017.
- Equivalence between policy gradients and soft q-learning. arxiv preprint arXiv:1704.06440, 2018.
- Adafactor: Adaptive learning rates with sublinear memory cost. arXiv, 2018.
- Benchmarks and algorithms for offline preference-based reward learning. arXiv, 2023.
- A long way to go: Investigating length correlations in rlhf. ArXiv, abs/2310.03716, 2023.
- Defining and characterizing reward gaming. In Neural Information Processing Systems, 2022.
- Aligning large multimodal models with factually augmented rlhf, 2023.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv, 2024.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023.
- Zephyr: Direct distillation of LM alignment. arXiv, 2023.
- Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. arXiv, 2023.
- Deep TAMER: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, 2022.
- Fine-grained human feedback gives better rewards for language model training, 2023.
- Is dpo superior to ppo for llm alignment? a comprehensive study. arxiv preprint arXiv:2404.10719, 2024.
- Self-rewarding language models, 2024a.
- Uni-rlhf: Universal platform and benchmark suite for reinforcement learning with diverse human feedback, 2024b.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv, abs/2304.05302, 2023.
- Token-level direct preference optimization, 2024.
- Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024.
- SLiC-HF: Sequence likelihood calibration with human feedback. arXiv, 2023a.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023b.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Consequences of misaligned AI. In Advances in Neural Information Processing Systems, 2020.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.