ALaRM: Align Language Models via Hierarchical Rewards Modeling
Abstract: We introduce ALaRM, the first framework modeling hierarchical rewards in reinforcement learning from human feedback (RLHF), which is designed to enhance the alignment of LLMs with human preferences. The framework addresses the limitations of current alignment approaches, which often struggle with the inconsistency and sparsity of human supervision signals, by integrating holistic rewards with aspect-specific rewards. This integration enables more precise and consistent guidance of LLMs towards desired outcomes, particularly in complex and open text generation tasks. By employing a methodology that filters and combines multiple rewards based on their consistency, the framework provides a reliable mechanism for improving model alignment. We validate our approach through applications in long-form question answering and machine translation tasks, employing gpt-3.5-turbo for pairwise comparisons, and demonstrate improvements over existing baselines. Our work underscores the effectiveness of hierarchical rewards modeling in refining LLM training processes for better human preference alignment. We release our code at https://ALaRM-fdu.github.io.
- Concrete problems in ai safety.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Longformer: The long-document transformer.
- Measuring progress on scalable oversight for large language models.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Ask the right questions: Active question reformulation with reinforcement learning. In International Conference on Learning Representations.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
- Drlc: Reinforcement learning with dense rewards from llm critic.
- Evaluating large language models trained on code.
- Supervising strong learners by amplifying weak experts.
- Ultrafeedback: Boosting language models with high-quality feedback.
- Safe rlhf: Safe reinforcement learning from human feedback.
- Alpacafarm: A simulation framework for methods that learn from human feedback.
- Stochastic neural networks for hierarchical reinforcement learning. In International Conference on Learning Representations.
- Compositional preference models for aligning lms.
- Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In Proceedings of the Conference on Robot Learning, volume 100 of Proceedings of Machine Learning Research, pages 1025–1037. PMLR.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Ai safety via debate.
- Personalized soups: Personalized large language model alignment via post-hoc parameter merging.
- LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
- Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18319–18345. PMLR.
- Hierarchical reinforcement learning with hindsight. In International Conference on Learning Representations.
- Let’s verify step by step.
- Controlling the reading level of machine translation output. In Proceedings of Machine Translation Summit XVII: Research Track, pages 193–203, Dublin, Ireland. European Association for Machine Translation.
- Confronting reward model overoptimization with constrained rlhf.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Hierarchical reinforcement learning: A comprehensive survey. ACM Comput. Surv., 54(5).
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Seonggi Ryang and Takeshi Abekawa. 2012. Framework of automatic text summarization using reinforcement learning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 256–265, Jeju Island, Korea. Association for Computational Linguistics.
- Hierarchical reinforcement learning for open-domain dialog. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8741–8748.
- Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845–874.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Preference ranking optimization for human alignment.
- Asqa: Factoid questions meet long-form answers.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
- Aligning large multimodal models with factually augmented rlhf.
- Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
- Llama 2: Open foundation and fine-tuned chat models.
- A survey on large language model based autonomous agents.
- Large language models are not fair evaluators.
- Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Aligning large language models with human: A survey.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Recursively summarizing books with human feedback.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Hierarchical deep reinforcement learning for continuous action control. IEEE Transactions on Neural Networks and Learning Systems, 29(11):5174–5184.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Improving reinforcement learning from human feedback with efficient reward model ensemble.
- Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc.
- Non-programmers can label programs indirectly via active examples: A case study with text-to-SQL. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Lima: Less is more for alignment.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.