SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Published 23 Oct 2024 in cs.CL, cs.AI, cs.IR, and cs.LG | (2410.17952v2)

Abstract: Retrieval-augmented generation (RAG) enhances the question-answering (QA) abilities of LLMs by integrating external knowledge. However, adapting general-purpose RAG systems to specialized fields such as science and medicine poses unique challenges due to distribution shifts and limited access to domain-specific data. To tackle this, we propose SimRAG, a self-training approach that equips the LLM with joint capabilities of question answering and question generation for domain adaptation. Our method first fine-tunes the LLM on instruction-following, question-answering, and search-related data. Then, it prompts the same LLM to generate diverse domain-relevant questions from unlabeled corpora, with an additional filtering strategy to retain high-quality synthetic examples. By leveraging these self-generated synthetic examples, the LLM can improve their performance on domain-specific RAG tasks. Experiments on 11 datasets, spanning two backbone sizes and three domains, demonstrate that SimRAG outperforms baselines by 1.2\%--8.6\%.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SimRAG as a two-stage fine-tuning framework that uses general instruction data followed by domain-specific self-training.
It generates pseudo-labeled QA pairs via synthetic question generation and round-trip consistency filtering to ensure high domain relevance.
Experiments reveal performance gains of 1.2% to 8.6% across biomedical, scientific, and computer science tasks, underscoring its practical impact.

SimRAG: Self-Improving Retrieval-Augmented Generation for Domain-Specific Adaptation

The paper introduces SimRAG, an innovative self-training approach designed to seamlessly adapt LLMs to domain-specific retrieval-augmented generation (RAG) tasks, with a focus on enhancing performance in specialized fields such as science and medicine.

Methodology

SimRAG's methodology centers on a two-stage fine-tuning process. Initially, LLMs are fine-tuned on general instruction-following and question-answering datasets. This step establishes a foundation for handling general-domain tasks by equipping models with basic instruction-following skills. The second stage involves generating synthetic, domain-specific QA pairs from unlabeled corpora through self-training. This is accomplished by prompting LLMs to generate domain-relevant questions and employing a filtering strategy to ensure high-quality synthetic examples.

Stage-I: Retrieval-Oriented Fine-Tuning

In the first stage, LLMs are enhanced by leveraging a combination of general instruction fine-tuning data and retrieval-related datasets. This includes a variety of QA tasks specifically chosen to reinforce context utilization skills. By integrating these diverse datasets, the models improve their grounding abilities, enabling them to effectively navigate knowledge-intensive scenarios.

Stage-II: Domain Adaptive Fine-Tuning with Self-Training

Stage-II capitalizes on self-training by generating pseudo-labeled training samples from domain-specific corpora. Candidate QA pairs are synthesized by first extracting potential answers and then generating questions conditioned on those answers and context. This process is refined using round-trip consistency filtering to retain only those pairs with robust domain relevance. The addition of tasks such as short-span QA, multiple-choice, and claim verification ensures the diverse nature of synthetic data, preventing overfitting and enhancing generalization across different tasks.

Figure 1: Effect of diverse types of generated QA pairs.

Experimental Results

The performance of SimRAG is evaluated across multiple datasets in the biomedical, scientific, and computer science domains. SimRAG consistently outperforms competing models by margins of 1.2% to 8.6%, illustrating its effectiveness in adapting LLMs to specific domains. The empirical studies underscore the advantages of joint training for question answering and generation, as well as the importance of employing diverse, tailored QA pairs.

Advantages and Limitations

SimRAG's framework offers substantial benefits by utilizing self-generated pseudo-labeled data, reducing the dependency on costly high-quality annotated data and offering a scalable solution for domain adaptation. However, the approach is constrained by its reliance on single-round query generation, potentially limiting the refinement of pseudo-label quality. Additionally, the method introduces additional computational overhead during pre-training phases, though inference complexity remains consistent with traditional RAG models.

Conclusion

SimRAG emerges as a compelling strategy for leveraging self-improvement within LLMs to tackle domain-specific RAG challenges. By enhancing models through joint capabilities in question answering and generation, SimRAG effectively navigates the complexities of domain specialization, ensuring robust performances across varied specialized tasks. Future work could explore iterative pseudo-label generation and the application of stronger query-generation models to further elevate its efficacy.