Aligning Large Language Models with Counterfactual DPO
Abstract: Advancements in LLMs have demonstrated remarkable capabilities across a diverse range of applications. These models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. However, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. Consequently, an additional alignment phase is typically employed, wherein the model is further trained with human preference data to better align its outputs with human expectations. While this process doesn't introduce new capabilities per se, it does accentuate generation styles innate to the model. This paper explores the utilization of counterfactual prompting within the framework of Direct Preference Optimization (DPO) to align the model's style without relying on human intervention. We demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions. Our findings suggest that counterfactual prompting with DPO presents a low-resource way to fine-tune LLMs to meet the demands for responsible and ethically aligned AI systems.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
- European Parliament. Eu ai act: first regulation on artificial intelligence, 2023. Accessed: 2024-01-15.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Teaching machines to read and comprehend. In NIPS, pages 1693–1701, 2015.
- Vectara hallucination leaderboard, 11 2023. If you use this dataset, please cite it using the metadata from this file.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Proximal policy optimization with model-based methods. J. Intell. Fuzzy Syst., 42:5399–5410, 2022.
- Concept understanding in large language models: An empirical study. 2023.
- Text2event: Controllable sequence-to-structure generation for end-to-end event extraction. arXiv preprint arXiv:2106.09232, 2021.
- Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
- Is chatgpt a general-purpose natural language processing task solver? ArXiv, abs/2302.06476, 2023.
- Improving language understanding by generative pre-training. 2018.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
- Vishvesh Soni. Large language models for enhancing customer lifecycle management. Journal of Empirical Social Science Studies, 7(1):67–89, 2023.
- Reinforcement learning: An introduction. MIT press, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Prompting palm for translation: Assessing strategies and performance. ArXiv, abs/2211.09102, 2022.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Contrastive post-training large language models on data curriculum. arXiv preprint arXiv:2310.02263, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.