DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Published 12 May 2025 in cs.CL and cs.AI | (2505.07233v2)

Abstract: Retrieval-augmented generation (RAG) systems combine LLMs with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker. Since irrelevant documents in RAG systems can mislead the generator, the reranker plays a vital role in refining retrieved documents to enhance generation quality and explainability. However, it is challenging to determine the appropriate number of documents ($k$) that the reranker should select: too few may result in missing critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results among models of same parameter sizes. The model, data and code are available at https://github.com/GasolSun36/DynamicRAG.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a reinforcement learning-based framework that leverages LLM outputs as reward signals to dynamically rerank retrieved documents for improved question answering.
It details a two-stage training pipeline combining behavior cloning and DPO-based RL optimization, resulting in state-of-the-art performance across multiple benchmarks.
Experimental results show enhanced ranking recall, targeted context selection, and a significant reduction in inference overhead with fewer LLM calls.

DynamicRAG: Reinforcement Learning-Based Dynamic Reranking for Retrieval-Augmented Generation

Motivation and Framework Overview

DynamicRAG addresses a central challenge in Retrieval-Augmented Generation (RAG): optimizing the reranking of retrieved documents before passing them to a generative LLM. Prior works largely relied on static reranking strategies or LLMs making ranking decisions without feedback from downstream generation quality. DynamicRAG formulates reranking as an RL problem, explicitly leveraging LLM-generated outputs as reward signals to train the reranking policy, allowing for dynamic adjustment of both document order and count per query. This agent-based paradigm enables query-adaptive context selection, directly optimizing for generation quality rather than arbitrarily fixed context sizes or simplistic document scoring heuristics.

Figure 1: DynamicRAG training formulation, which iteratively incorporates behavior cloning, trajectory sampling, generator-based feedback, and DPO-based RL optimization.

Methodology

Reranker as RL Agent

DynamicRAG models the reranker as an interactive agent within a partially observable MDP, where actions correspond to selecting ordered document subsets from the retrieval pool. The generator functions as the environment, producing responses conditioned on the selected documents. Rewards are computed via a multi-dimensional function integrating Exact Match, Semantic Similarity (BERTScore), Textual Fluency (ROUGE), Length Penalty, and LLM-driven quality assessment. This formulation enables the reranker to learn query-sensitive document selection policies maximizing downstream accuracy and informativeness.

Training Pipeline

The training pipeline unfolds in two primary stages:

Behavior Cloning: The reranker undergoes supervised fine-tuning using expert trajectories (e.g., MonoT5 outputs), furnishing a preliminary policy for list-wise response document selection.
RL Optimization via DPO: The reranker explores trajectory space by sampling multiple document combinations per query, receiving generator-based reward feedback. DPO is used to optimize the reranker, encouraging preference for trajectories yielding superior responses.

RL-based training facilitates the learned policy’s capacity to adapt document selection not only by relevance, but also by explicit generation outcome, robustly accommodating diverse query complexities.

Experimental Results

Generation Performance

Across seven benchmarks (e.g., NQ, HotpotQA, TriviaQA, ASQA), DynamicRAG consistently outperforms both retrieval-free LLMs (GPT-3.5, GPT-4, GPT-4o) and competitive retrieval-augmented baselines (RankRAG, ChatQA-1.5, Reward-RAG), achieving state-of-the-art EM, accuracy, and ROUGE-L scores for all backbone sizes (LLaMA2-7B, LLaMA2-13B, LLaMA3-8B). Notable gains are achieved despite DynamicRAG often utilizing fewer training instances compared to some agent baselines.

Figure 2: Comparative performance on Natural Questions, highlighting DynamicRAG’s improvements.

Reranker Effectiveness

DynamicRAG's reranker substantially improves ranking recall (R@5, R@10, R@20) over pointwise and listwise reranking baselines, even when trained on limited expert trajectory data. Empirical ablations reveal that list-wise RL optimization and generator-based feedback are critical for these gains, outperforming static IR models (BM25, Contriever) and fine-tuned rankers (monoT5, RankLLaMA).

Ablation Studies

Eliminating retrieval, reranking, or RL degrades downstream QA performance by substantial margins (up to 30 EM points). RL improves not only answer accuracy but also optimizes the number of documents provided to the generator, as length penalties and feedback induce more efficient, targeted context selection.

Scalability and Efficiency

DynamicRAG demonstrates efficient inference, requiring only two LLM calls per query for generation with Top-20 inputs, contrasted with exhaustive pairwise scoring approaches such as RankRAG that invoke the LLM up to 20 times for the same pool. Token-level profiling suggests a ~17x improvement in throughput without sacrificing answer quality. Sharing reranker and generator parameters further enhances downstream performance.

Analysis of Dynamic Document Selection

The RL-based training shifts the distribution of selected document counts ( $k$ ) leftward, with post-training policies favoring fewer documents due to regularization and reward-based feedback, reducing context window usage and potential noise. The system’s performance as a function of Top-N document pool size evidences a clear optimum: expanding Top-N improves accuracy until extraneous noise degrades generation (peak at Top-100 to Top-150).

Figure 3: DynamicRAG’s performance as a function of document pool size across NQ, HotpotQA, and ASQA using Exact Match.

Implications and Theoretical Insights

DynamicRAG operationalizes RL for learning listwise context selection, shifting the reranking paradigm from static, heuristic-driven approaches to adaptive, outcome-driven policies. The capacity to dynamically adjust context size per query not only improves factual accuracy but also computational efficiency. Importantly, the model demonstrates generalization across open-sourced and closed-sourced LLMs, indicating the broad applicability of RL-based reranking even within proprietary QA pipelines, as evidenced by improved GPT-4o performance using DynamicRAG’s reranker module.

Practically, DynamicRAG’s efficiency enables deployment in contexts where inference cost is critical. Theoretically, the system augments RAG by closing the feedback loop between retrieval and generation, enabling end-to-end optimization. Future research directions include extending RL-based reranking to finer-grained observation-based reward modeling, joint retriever-reranker-generator RL training, and robust adaptation to domain-heterogeneous corpora.

Conclusion

DynamicRAG presents a robust RL-based framework for dynamic reranking in RAG, leveraging direct LLM feedback to optimize context selection and generation performance. Experimentally validated across multiple knowledge-intensive benchmarks, DynamicRAG sets new standards for both accuracy and efficiency, demonstrating the strategic advantage of RL in retrieval-augmented pipelines. The methodology’s practical and theoretical strengths anticipate further advances in context-adaptive document selection and end-to-end generative QA architectures (2505.07233).