ORPO: Monolithic Preference Optimization without Reference Model
Abstract: While recent preference alignment algorithms for LLMs have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art LLMs with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 2). We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- The falcon series of open language models.
- A general theoretical paradigm to understand learning from human preferences.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Constitutional ai: Harmlessness from ai feedback.
- Notus. https://github.com/argilla-io/notus.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324--345.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901. Curran Associates, Inc.
- Ulma: Unified language model alignment with demonstration and point-wise human preference. ArXiv, abs/2312.02554.
- Extracting training data from large language models.
- Grath: Gradual self-truthifying for large language models.
- Can ai assistants know what they don’t know?
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
- Qlora: Efficient finetuning of quantized llms.
- Enhancing chat language models by scaling high-quality instructional conversations.
- How abilities in large language models are affected by supervised fine-tuning data composition.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Scaling laws for reward model overoptimization.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, Online. Association for Computational Linguistics.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
- Alexey Gorbatovski and Sergey Kovalchuk. 2024. Reinforcement learning for question answering in programming domain using public community scoring as a human feedback.
- Hamish Haggerty and Rohitash Chandra. 2024. Self-supervised learning for skin cancer diagnosis with limited training data.
- Mojan Javaheripi and Sébastien Bubeck. 2023. Phi-2: The surprising power of small language models.
- Mistral 7b.
- Understanding the effects of rlhf on llm generalisation and diversity.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715--4728, Online. Association for Computational Linguistics.
- Self-alignment with instruction backtranslation.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Textbooks are all you need ii: phi-1.5 technical report.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980--2988.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
- Cross-entropy loss functions: Theoretical analysis and applications.
- Training language models to follow instructions with human feedback.
- Language model self-improvement by reinforcement learning contemplation.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
- Automatic prompt optimization with ‘‘gradient descent’’ and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957--7968, Singapore. Association for Computational Linguistics.
- Direct preference optimization: Your language model is secretly a reward model.
- Aligning neural machine translation models: Human feedback in training and inference.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
- Proximal policy optimization algorithms.
- Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2859--2873, Singapore. Association for Computational Linguistics.
- Preference ranking optimization for human alignment.
- Learning to summarize from human feedback.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Fine-tuning language models for factuality.
- Llama: Open and efficient foundation language models.
- Zephyr: Direct distillation of lm alignment.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Secrets of rlhf in large language models part ii: Reward modeling.
- How far can camels go? exploring the state of instruction tuning on open resources.
- Finetuned language models are zero-shot learners.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
- Opt: Open pre-trained transformer language models.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs].
- Lima: Less is more for alignment.
- Lobass: Gauging learnability in supervised fine-tuning data. ArXiv, abs/2310.13008.
- Fine-tuning language models from human preferences.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.