Papers
Topics
Authors
Recent
Search
2000 character limit reached

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Published 24 Oct 2024 in cs.AI and cs.CL | (2410.18451v1)

Abstract: In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834, 2024.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  6. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  7. The dangers of inference using the bradley-terry model. The Annals of Statistics, 38(3):1491–1514, 2010.
  8. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  9. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  10. L. Daniele and Suphavadeeprasit. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(coming soon), 2023. URL https://huggingface.co/datasets/LDJnr/Capybara.
  11. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024.
  12. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  13. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR, 2022.
  14. The Elements of Statistical Learning. Springer, 2001.
  15. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  16. Deep Learning. MIT press, 2016.
  17. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
  18. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
  19. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
  20. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
  21. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510, 2024.
  22. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  23. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
  24. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  25. T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017.
  26. Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847, 2024.
  27. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  28. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551, 2024.
  29. From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a.
  30. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.
  31. RyokoAI. ShareGPT52K Dataset. https://huggingface.co/datasets/RyokoAI/ShareGPT52K, 2023.
  32. Do user preferences and evaluation measures line up? In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 555–562, 2010.
  33. Kernel methods for pattern analysis. In Proceedings of the IEEE, volume 12, pages 406–417. IEEE, 2001.
  34. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  35. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  36. G. Team. Gemma. 2024. 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  38. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a.
  39. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024b.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a.
  42. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, 2024b.
  43. Direct judgement preference optimization. arXiv preprint arXiv:2409.14664, 2024c.
  44. Self-taught evaluators. arXiv preprint arXiv:2408.02666, 2024d.
  45. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023.
  46. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e.
  47. Metametrics: Calibrating metrics for generation tasks using human preferences. arXiv preprint arXiv:2410.02381, 2024.
  48. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024.
  49. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216, 2024.
  50. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078, 2024.
  51. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023.
  52. General preference modeling with preference representations for aligning language models. arXiv preprint arXiv:2410.02197, 2024.
  53. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
  54. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 0 likes about this paper.