InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

Published 11 Oct 2023 in cs.CL, cs.AI, cs.IR, and cs.LG | (2310.07713v3)

Abstract: Pretraining auto-regressive LLMs~(LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the largest LLM pretrained with retrieval. Specifically, we continue to pretrain a 43B GPT model on additional 100 billion tokens using the Retro augmentation method by retrieving from 1.2 trillion tokens. Notably, the obtained foundation model, Retro 48B, largely outperforms the counterpart GPT 43B trained on 1.2T tokens in terms of perplexity with only 2.58% additional GPU hours, demonstrating the significant scaling potential of the method. After instruction tuning on Retro, InstructRetro demonstrates significant improvement over the instruction tuned GPT on a wide range of zero-shot tasks. Specifically, the average improvement of InstructRetro is 7% over its GPT counterpart across 8 short-form QA and reading comprehension tasks, 10% over GPT across 4 challenging long-form QA tasks, and 16% over GPT across 3 summarization tasks. Surprisingly, we find that one can ablate the encoder from InstructRetro architecture and directly use its decoder backbone, while achieving comparable results. Our results highlight the promising direction to obtain a better GPT decoder through continued pretraining with retrieval before instruction tuning. Our code and checkpoints are publicly available at: https://huggingface.co/nvidia/retro-48b-instruct-4k.

Abstract PDF Upgrade to Chat

Citations (34)

View on Semantic Scholar

Summary

The paper demonstrates that integrating retrieval-augmented pretraining with instruction tuning enhances zero-shot task performance by 7% to 16% compared to similar-sized models.
The study details the efficient scale-up from Retro 7.5B to Retro 48B, achieving significant performance gains with only a 2.58% increase in GPU hours.
The paper reveals that a decoder-only configuration maintains comparable performance to full architectures, simplifying design while preserving retrieval benefits.

Essay on "InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining"

The paper, "InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining," delineates an innovative approach to enhancing LLMs through a methodical application of retrieval-augmented pretraining followed by instruction tuning. This research introduces Retro 48B, a LLM distinguished by its significant increase in parameter size compared to its predecessors, thereby setting new benchmarks for performance in LLM pretraining and downstream task execution.

Overview of Methodology

The research leverages retrieval-augmented pretraining, a method which incorporates external databases to improve LLM precision and factual accuracy. Retro 48B represents a significant scale-up from previous generations like Retro 7.5B, demonstrating the authors' ability to efficiently scale the model while maintaining computational feasibility. This was achieved by initially pretraining a foundational 43B GPT model and subsequently enhancing it using the Retro augmentation method across an expansive corpus of 1.2 trillion tokens, with only a 2.58% increase in GPU hours compared to its unmodified counterpart. This efficiency in scaling highlights the model's potential and cost-effectiveness in achieving superior performance.

Instruction Tuning and Evaluation

Instruction tuning further refines the model's capacity to generalize across tasks without task-specific training data. InstructRetro, an outcome of this process, has shown notable improvements in zero-shot task handling. It excels in short-form QA, reading comprehension, long-form QA, and summarization tasks, showing an average improvement of 7%, 10%, and 16% across various task types when compared to a similarly sized GPT model. These figures underscore the model's enhanced ability to adapt to diverse task demands post instruction tuning.

Architectural Insights

One of the intriguing findings in this study is the ability to ablate the encoder from the InstructRetro architecture. Remarkably, the decoder-only configuration (InstructRetro 43B) achieved results comparable to the complete original architecture (InstructRetro 48B). This suggests that the fundamental benefits from retrieval-augmented pretraining are well-preserved in the decoder, simplifying the architectural complexity without sacrificing performance.

Practical Implications and Future Directions

The practical implications of this research are substantial. Applications requiring rapid adaptation to new or uncommon tasks could benefit significantly from the model's zero-shot capability. Additionally, the findings suggest a promising directive for future research: refining the integration and instruction-tuning processes of LLMs with minimal reliance on fully retained architectures. Further research should explore retrieval-augmented instruction tuning with high-quality corpus-specific retrieval data to maximize the potential benefits of retrieval augmentation fully.

Conclusion

The "InstructRetro" paper provides a detailed and technical advancement in the arena of LLMs by demonstrating how retrieval-augmented pretraining, combined with instruction tuning, can produce models with superior performance and enhanced task adaptation capabilities. The study not only broadens the potential applications of LLMs but also informs future research pathways for refining and optimizing LLM architectures.

Markdown Report Issue