Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Abstract: Fine-tuning LLMs~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST${EM}$, where we (1) generate samples from the model and filter them using binary feedback, (2) fine-tune the model on these samples, and (3) repeat this process a few times. Testing on advanced MATH reasoning and APPS coding benchmarks using PaLM-2 models, we find that ReST${EM}$ scales favorably with model size and significantly surpasses fine-tuning only on human data. Overall, our findings suggest self-training with feedback can substantially reduce dependence on human-generated data.
- Learning to generalize from sparse and underspecified rewards. In International conference on machine learning, pages 130–140. PMLR, 2019.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
- Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712, 2023. 10.48550/ARXIV.2303.12712. URL https://doi.org/10.48550/arXiv.2303.12712.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22, 1977.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021a.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021b.
- Large language models can self-improve. CoRR, abs/2210.11610, 2022. 10.48550/ARXIV.2210.11610. URL https://doi.org/10.48550/arXiv.2210.11610.
- Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020, 2016.
- Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
- Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29, 2016.
- OpenAI. Gpt-4 technical report, 2023.
- K. Paster. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
- J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=1PL1NIMMrw.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
- Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.