Query Suggestion for Agentic RAG

Updated 14 January 2026

The paper introduces a dynamic query suggestion method that integrates few-shot learning with QPP metrics to generate answerable sub-queries in agentic RAG systems.
It details a process of masking entity values and retrieving past templates to guide sub-query formation, ensuring alignment with available tool workflows.
Experimental results highlight improved retrieval performance and reduced agentic iterations, confirming enhanced answer quality and system efficiency.

Agentic Retrieval-Augmented Generation (RAG) systems define a paradigm in which a LLM dynamically interleaves reasoning steps and explicit invocations of retrieval tools, deciding when to fetch external documents and how to formulate sub-queries at each reasoning turn. This agentic control loop contrasts with static RAG pipelines by enabling the LLM to determine autonomously whether its internal knowledge is sufficient or external evidence is required—issuing a tool call such as <search> Q′ </search> to trigger retrieval, after which retrieved passages are incorporated into the ongoing reasoning context. Suggesting high-quality, answerable, and intent-preserving sub-queries under agentic RAG is a nontrivial challenge, involving complex trade-offs between intent similarity, answerability within the agent/tool constraints, adaptive workflow integration, and interaction with downstream query performance measures (Tian et al., 14 Jul 2025, Spaeh et al., 13 Jan 2026).

1. Foundations of Query Suggestion in Agentic RAG

In an agentic RAG regime, the LLM receives a user query and, at each reasoning step, can (a) continue internal reasoning; (b) generate a retrieval sub-query Q′; or (c) emit a final answer. Each <search> Q′ </search> trigger prompts the retrieval system to return the top-k documents, which are fed back into the LLM’s context for further reasoning and possible iterative sub-query refinement (Tian et al., 14 Jul 2025).

Key challenges for query suggestion under this paradigm include ensuring that (i) suggested queries are “answerable” in the sense that the RAG agent has available workflows and data to return a non-empty, plausible answer; (ii) the agent does not hallucinate possible tool capabilities or data scope; and (iii) the resulting sub-queries optimize answer quality and system efficiency in multi-step agentic workflows (Spaeh et al., 13 Jan 2026).

2. Answerability-Centric Suggestion: Dynamic Few-Shot Learning

Classical query recommendation solutions from web search do not transfer well, as agentic RAG answerability depends on the existence of concrete tool workflows and data coverage within the agent. To address this, robust dynamic few-shot in-context learning is proposed (Spaeh et al., 13 Jan 2026). The workflow is:

Question Templating: Mask all entity values from the user query to obtain a schema-like template τ(q) (e.g., “How many invoices processed in [timespan]?”), abstracting away specifics to expose the workflow requirement.
Robust Example Retrieval: Given τ(q), retrieve a set of past templates from a continually updated database D with known answerability labels (answerable, not answerable) using embedding similarity, and filter using a majority-vote clustering mechanism to suppress mislabeled/hallucinated templates.
Few-Shot Generation and Value Imputation: Supply positive and negative templates and explanations to the LLM, asking it to generate a similar, answerable template for query suggestion, then impute values using known-good arguments, data-derived hints, or sampled from schema constraints.

Empirical evaluation demonstrates that dynamic few-shot learning yields higher answerability and intent similarity (e.g., 82.5% answerable for InvoicesPython vs. 59.7% with static few-shot) and adapts as D is bootstrapped from real agent usage logs (Spaeh et al., 13 Jan 2026). The method is self-improving, requiring only continual ingestion of user queries and auto-evaluated execution traces.

3. Query Performance Prediction (QPP) for Adaptive Suggestion

Query Performance Prediction (QPP) provides a toolkit for estimating the expected quality of retrieval for any sub-query Q′ without access to explicit relevance labels. QPP signals—such as retrieval score variance (NQC), maximum retrieval score, embedding-based spread (hypercube volume), and pairwise embedding coherence (A-Pair-Ratio)—can serve as proxies for how likely the retrieved results will yield high-quality answers (Tian et al., 14 Jul 2025).

The integration of QPP into agentic RAG query suggestion entails:

Computing post-retrieval QPP metrics for each generated sub-query.
If QPP(Q′) < τ_min, revising or regenerating Q′ before passing it to the agent, using information from high-QPP retrieved documents to boost term specificity or expand with named entities/key phrases.
Implementing early stopping or query refinement loops if QPP remains below threshold, thus preventing propagation of poorly grounded queries.

Experimental evidence shows that first-step QPP is weakly but positively correlated with final answer F1 (Spearman ρ up to 0.25 for best setups), and pipelines using more effective retrievers (e.g., E5 dense) systematically achieve both higher answer quality and fewer agentic iterations. Shorter agentic retrieval chains are further associated with improved accuracy (ρ(Iter,F1) ≈ –0.3) (Tian et al., 14 Jul 2025).

4. Methodological Integration and System-Level Patterns

A robust agentic RAG system for query suggestion orchestrates the interplay between LLM-generated suggestion, retrieval outcomes, and QPP feedback in adaptive control flows. Central design patterns, as supported by empirical and architectural analyses, include:

Adaptive Reasoning Loops: LLM alternates between reasoning (> ... ) and retrieval (<search> Q′ </search>), where each new sub-query is optionally revised based on QPP or explicit rules for too-low answerability (Tian et al., 14 Jul 2025, Spaeh et al., 13 Jan 2026).
Asynchronous QPP and Modular Microservices: Real-time RAG deployments may parallelize retrieval and QPP computations, or expose QPP via a <qpp> Q′ </qpp> API, leveraging caching for repeated sub-queries (Tian et al., 14 Jul 2025).
Dynamic Context Construction: Inclusion of dynamically retrieved, relevant few-shot contexts or workflow patterns (as “dynamic contexts”) improves both the linguistic quality and answerability of suggestion questions, outperforming static or naive retrieval approaches (Tayal et al., 2024).

5. Practical Guidelines for Query Suggestion in Agentic RAG

Implementation best practices derived from recent empirical and theoretical advances:

Always mask entity values in queries when retrieving few-shot templates to focus pattern matching on workflow rather than surface tokens (Spaeh et al., 13 Jan 2026).
Continually update the database of answerable templates using self-learned executions and LLM-based evaluators; hundreds of queries suffice to reach high suggestion accuracy.
Compose two-phase suggestion: (1) generate template via few-shot dynamic in-context learning; (2) impute or sample values from tool schema, empirical tool responses, or prior success cases.
Integrate post-retrieval QPP signals into the agentic control flow, with user-definable τ_min thresholds for early stopping and prompt regeneration.
Balance QPP-driven accuracy gains against latency overhead; prefer fast or asynchronously computed QPP statistics for high-throughput requirements (Tian et al., 14 Jul 2025).
Use domain-adaptive, ontology-backed controllers where possible to refine suggestions in specialized verticals (e.g., medical, financial) and maintain high coverage.
For dynamic context systems, retrieve both example QAS triplets and background context, and assemble explicit, answerable prompts with clear constraints to minimize hallucination (Tayal et al., 2024).

6. Limitations and Open Questions

Identified limitations include LLM reliance for templating and answerability evaluation (potential failures on low-capacity models), inability of the current framework to reason over hybrid multi-tool or intermediate-result-dependent workflows, and the necessity of accurate in-auto-labeling evaluators for template database construction (Spaeh et al., 13 Jan 2026). QPP-based filtering introduces additional computational overhead, which must be traded against improved answer accuracy (Tian et al., 14 Jul 2025). Open research challenges include learning direct workflow embeddings, incorporating on-the-fly tool introspection, and closing the suggestion loop with reinforcement learning based on agentic performance metrics.

7. Impact on Agentic RAG System Efficiency and Reliability

Agentic RAG systems augmented with answerability-focused, QPP-aware query suggestion modules demonstrate greatly improved user interaction quality and reduced hallucination, as measured by higher empirical answerability rates and positive correlation with answer F1 (Tian et al., 14 Jul 2025, Spaeh et al., 13 Jan 2026). These improvements are attributable to the system’s ability to block unanswerable paths, reduce wasted retrieval iterations, and adaptively refine sub-queries based on historical agentic workflows. This results in not only more efficient computation but also increased trustworthiness of the RAG agent’s responses, particularly in settings where knowledge gaps or tool-coverage uncertainty would otherwise lead to degraded user experience and erroneous outputs.

References:

"Am I on the Right Track? What Can Predicted Query Performance Tell Us about the Search Behaviour of Agentic RAG" (Tian et al., 14 Jul 2025)
"Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning" (Spaeh et al., 13 Jan 2026)
"Dynamic Contexts for Generating Suggestion Questions in RAG Based Conversational Systems" (Tayal et al., 2024)