Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
Abstract: Large decoder-only LMs can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our code and model at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md
- Structured retrieval for question answering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI.
- Improving language models by retrieving from trillions of tokens. In ICML.
- Language models are few-shot learners. NeurIPS.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv: 2210.11416.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Eli5: Long form question answering. Annual Meeting of the Association for Computational Linguistics.
- A framework for few-shot language model evaluation.
- Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):744–755.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings in EMNLP.
- R.M. Gray and D.L. Neuhoff. 1998. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383.
- REALM: Retrieval augmented language model pre-training. In ICML.
- The curious case of neural text degeneration. International Conference On Learning Representations.
- Unnatural instructions: Tuning language models with (almost) no human labor. Annual Meeting of the Association for Computational Linguistics.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Gautier Izacard and Édouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Dense passage retrieval for open-domain question answering. In EMNLP.
- Generalization through memorization: Nearest neighbor language models.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv: 2212.10465.
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.
- Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566.
- Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327.
- Race: Large-scale reading comprehension dataset from examinations. In EMNLP.
- Factuality enhanced language models for open-ended text generation. NeurIPS.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS.
- TruthfulQA: Measuring how models mimic human falsehoods. ACL.
- The flan collection: Designing data and methods for effective instruction tuning. International Conference on Machine Learning.
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.
- Locating and editing factual knowledge in GPT. In NeurIPS.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Adversarial nli: A new benchmark for natural language understanding. In ACL.
- OpenAI. 2022. ChatGPT. https://chat.openai.com.
- OpenAI. 2023. GPT-4 technical report. arXiv.
- The lambada dataset: Word prediction requiring a broad discourse context. In NAACL.
- Language models as knowledge bases? In EMNLP.
- Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21:1112–1130.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In NAACL.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
- Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
- Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Attention is all you need. In NIPS.
- Instructretro: Instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv: 2310.07713.
- Self-instruct: Aligning language models with self-generated instructions. Annual Meeting of the Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. In ICLR.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Challenges in detoxifying language models. Findings of EMNLP.
- Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics.
- Hellaswag: Can a machine really finish your sentence? In ACL.
- Guiding neural machine translation with retrieved translation pieces. In NAACL.
- Yangqiaoyu Zhou and Chenhao Tan. 2021. Investigating the effect of natural language explanations on out-of-distribution generalization in few-shot NLI. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, pages 117–124, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.