Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

Published 13 Apr 2023 in cs.CL, cs.AI, cs.IR, and cs.LG | (2304.06762v3)

Abstract: Large decoder-only LMs can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT incorporated at fine-tuning or inference stages. We first provide the recipe to reproduce RETRO up to 9.5B parameters while retrieving a text corpus with 330B tokens. Based on that, we have the following novel findings: i) RETRO outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database. ii) On the LM Evaluation Harness benchmark, RETRO largely outperforms GPT on knowledge-intensive tasks, but is on par with GPT on other tasks. Furthermore, we introduce a simple variant of the model, RETRO++, which largely improves open-domain QA results of original RETRO (e.g., EM score +8.6 on Natural Question) and significantly outperforms retrieval-augmented GPT in both fine-tuning and zero-shot evaluation settings. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models. We release our code and model at: https://github.com/NVIDIA/Megatron-LM/blob/main/tools/retro/README.md

Abstract PDF HTML Upgrade to Chat

References (61)

Citations (52)

View on Semantic Scholar

Summary

The paper reproduces and scales the Retro model up to 9.5B parameters by integrating retrieval mechanisms that enhance efficiency.
The study demonstrates that retrieval-augmented models reduce text degeneration and improve factual accuracy compared to conventional GPT models.
The introduction of Retro++ significantly boosts performance in open-domain QA, underscoring practical benefits for knowledge-intensive tasks.

Pretraining Autoregressive LLMs with Retrieval: A Comprehensive Study

The paper "Shall We Pretrain Autoregressive LLMs with Retrieval? A Comprehensive Study" explores the impact and implications of integrating retrieval mechanisms into the pretraining of large autoregressive LLMs. The study is presented in the context of current advancements in LLMs such as GPT and Retro, with a focus on evaluating the benefits of combining retrieval capabilities at different stages of LLM development.

Key Contributions

Reproduction of Retro Models: The research successfully reproduces and scales the Retro model up to 9.5 billion parameters. This encompasses retrieving text from a vast corpus of 330 billion tokens, aligning with a comprehensive evaluation against traditional GPT models.
Evaluation of Text Generation: Retro is shown to outperform GPT on open-ended text generation tasks, particularly by reducing text degeneration, enhancing factual accuracy, and slightly decreasing toxicity. This suggests that the retrieval component effectively supplements the model's knowledge base.
Performance on Downstream Tasks: The paper conducts evaluations on various benchmarks, notably demonstrating that Retro excels in knowledge-intensive tasks. For example, significant improvements were observed in tasks like open-domain QA.
Retro++ Innovation: A variant named Retro++ is introduced, which markedly improves performance on open-domain question answering benchmarks. This model modification exploits the most relevant retrieved evidence, thereby enhancing accuracy and generation quality.

Numerical Results and Implications

Perplexity Reduction: Retro achieves lower perplexity compared to standard GPT across various model sizes, indicating improved LLM efficiency through retrieval augmentation.
Downstream Task Performance: A marked improvement in accuracy on knowledge-intensive benchmarks was observed, highlighting the utility of retrieval-enhanced LMs in tasks that demand access to vast, explicit knowledge.

Theoretical and Practical Implications

The findings propose that pretraining autoregressive LLMs with retrieval capabilities could set a new standard for future foundational models. This approach not only reduces the need for larger parameter sizes by offloading some knowledge storage to an external database but also provides a method to update models with fresh information without extensive retraining.

Theoretical Impact: The integration of retrieval mechanisms into LLMs suggests a shift in how knowledge is managed within LMs, balancing between internalized knowledge and external retrieval.
Practical Impact: This method has potential applications in real-world scenarios where factual accuracy and information update frequency are crucial, such as in legal, medical, and educational domains.

Speculation on Future Developments

Looking forward, this research opens avenues to test even larger-scale retrieval-augmented models, exploring how dynamic retrieval updates during generation can further enhance model performance. It suggests the possibility of real-time retrieval augmentation, which could make LLMs more adaptive and contextually aware.

Overall, this study provides a compelling case for the incorporation of retrieval mechanisms in the pretraining of autoregressive LLMs, illustrating both their current utility and potential for future advancements. The adaptability and resource efficiency offered by such models make them attractive candidates for a wide range of applications, signaling a promising direction for ongoing research in the field of AI and NLP.