Supervised Fine-Tuning as Inverse Reinforcement Learning
Abstract: The prevailing approach to aligning LLMs typically relies on human or AI feedback and assumes access to specific types of preference datasets. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic. We build a sequential decision-making framework to formulate the problem of aligning LLMs using demonstration datasets. Drawing insights from inverse reinforcement learning and imitation learning, we introduce various approaches for divergence minimization in the LLM alignment tasks. Our analysis highlights the mass-covering and mode-seeking behaviors of these different approaches. Inclusively, we examine the pros and cons of the classical supervised fine-tuning method, elaborating on scenarios where different methods shine.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
- Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE international conference on robotics and automation (ICRA), pages 2641–2646. IEEE, 2015.
- Urban driver: Learning to drive from real-world demonstrations using policy gradients. In Conference on Robot Learning, pages 718–728. PMLR, 2022.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
- Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
- Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991.
- Accountable batched control with decision corpus. Advances in Neural Information Processing Systems, 36, 2023.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Rethinking goal-conditioned supervised learning and its connection to offline rl. arXiv preprint arXiv:2202.04478, 2022.
- Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
- Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
- Strictly batch imitation learning by energy-based distribution matching. Advances in Neural Information Processing Systems, 33:7354–7365, 2020.
- A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1259–1277. PMLR, 2020.
- Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925, 2018.
- What matters for adversarial imitation learning? Advances in Neural Information Processing Systems, 34:14656–14668, 2021.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
- A distributional approach to controlled text generation. arXiv preprint arXiv:2012.11635, 2020.
- On decoding strategies for neural text generators. Transactions of the Association for Computational Linguistics, 10:997–1012, 2022.
- Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240, 2023.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287, 1999.
- Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782, 2024.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- The rating of chessplayers: Past and present. (No Title), 1978.
- f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.