RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation

Published 19 Feb 2025 in cs.CL and cs.AI | (2502.13957v2)

Abstract: Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re$^2$Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps. Together, these findings lead to the optimized Re$^2$Search++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization. The project homepage is available at https://rag-gym.github.io.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the RAG-Gym framework, which models information-seeking as a nested Markov Decision Process (MDP) and uses process supervision to optimize language agents for complex retrieval-augmented generation (RAG).
A key innovation is the ReSearch agent architecture, which unifies answer reasoning with search query generation to strategically identify and fill knowledge gaps in questions.
Empirical evaluations demonstrate that RAG-Gym significantly improves performance, achieving a 25.6% gain over baselines on various QA datasets and shows robust reward model transferability.

RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision

The paper "RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision" introduces a robust framework designed to enhance the efficacy of information-seeking agents by integrating retrieval-augmented generation (RAG) with process supervision mechanisms. This work addresses key limitations in traditional RAG architectures, primarily their dependence on static retrieval processes, which restricts their utility in handling complex, sequential information-requiring tasks commonly seen in multi-hop question-answering scenarios.

The authors propose the RAG-Gym framework, which re-envisions the process of knowledge-intensive question answering as a nested Markov Decision Process (MDP). This structure divides the task into an outer MDP, which orchestrates high-level actions interacting with an information retrieval (IR) environment, and an inner MDP that manages the detailed token generation within LLMs. Such an approach allows the incorporation of fine-grained process supervision, thus optimizing language agent policies through iterative assessments of intermediate steps rather than solely through final outcome evaluations.

A key innovation presented is the ReSearch agent, which unifies the reasoning of answers with the generation of search queries, thereby ensuring that retrieval actions directly contribute to answer formulation. The ReSearch architecture strategically leverages refined answer reasoning to identify knowledge gaps in a question, driving search queries that specifically aim to fill these gaps. This contrasts markedly with existing agents like ReAct, which depend heavily on heuristic-driven prompts that may not generalize seamlessly across diverse tasks.

Empirical evaluations conducted over datasets such as HotpotQA, 2WikiMultihopQA, Bamboogle, and MedQA indicate the superiority of RAG-Gym and ReSearch. These include a 25.6% performance improvement over baseline metrics. The study highlights the notable effectiveness of the proposed process reward models, demonstrating significant advancements in answer accuracy and reasoning robustness when trained with finely annotated process data derived from LLM outputs like GPT-4o.

Furthermore, the framework is shown to facilitate substantial transferability of trained reward models across various LLM implementations, indicating their utility in optimizing proprietary models where direct parameter tuning might be constrained. The exploration of the scaling properties of both the training and inference phases within this context provides additional insights into the effectiveness of RAG-Gym across variable operational scales.

In conclusion, this paper offers significant contributions to the field of machine learning by presenting a comprehensive framework—RAG-Gym—that bridges current gaps in retrieval-augmented generation for complex, multi-hop reasoning tasks. The proposed combination of a nested MDP approach with process-level supervision offers a paradigm shift in how information-seeking agents are trained and optimized, potentially setting a new standard for future AI research and application in diverse, knowledge-intensive domains.

Markdown