Knowing You Don't Know: Learning When to Continue Search in Multi-round RAG through Self-Practicing
Abstract: Retrieval Augmented Generation (RAG) has shown strong capability in enhancing LLMs' knowledge and reducing AI generative hallucinations, driving its widespread use. However, complex tasks requiring multi-round retrieval remain challenging, and early attempts tend to be overly optimistic without a good sense of self-skepticism. Current multi-round RAG systems may continue searching even when enough information has already been retrieved, or they may provide incorrect answers without having sufficient information or knowledge. Existing solutions either require large amounts of expensive human-labeled process supervision data or lead to subpar performance. This paper aims to address these limitations by introducing a new framework, SIM-RAG, to explicitly enhance RAG systems' self-awareness and multi-round retrieval capabilities. To train SIM-RAG, we first let a RAG system self-practice multi-round retrieval, augmenting existing question-answer pairs with intermediate inner monologue reasoning steps to generate synthetic training data. For each pair, the system may explore multiple retrieval paths, which are labeled as successful if they reach the correct answer and unsuccessful otherwise. Using this data, we train a lightweight information sufficiency Critic. At inference time, the Critic evaluates whether the RAG system has retrieved sufficient information at each round, guiding retrieval decisions and improving system-level self-awareness through in-context reinforcement learning. Experiments across multiple prominent RAG benchmarks show that SIM-RAG is an effective multi-round RAG solution. Furthermore, this framework is system-efficient, adding a lightweight component to RAG without requiring modifications to existing LLMs or search engines, and data-efficient, eliminating the need for costly human-annotated mid-step retrieval process supervision data.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a consolidated list of concrete gaps and unresolved questions that future work could address to strengthen and generalize SIM-RAG.
- Conflation of sufficiency and correctness in training labels: Self-practicing labels Accept/Reject via exact answer match (A' == A), not whether C actually supports A'. How to construct evidence-grounded labels (e.g., NLI/entailment over retrieved passages) so the Critic truly learns “information sufficiency” rather than answer correctness?
- Over-reliance on exact-match labeling: Exact string equality can mislabel paraphrases and aliases as incorrect. What is the best way to incorporate semantic equivalence (normalization, alias lists, LLM-based equivalence checks) during self-practice labeling?
- Coarse, binary feedback: Accept/Reject does not distinguish error modes (missing evidence, contradictory evidence, unsupported-but-correct guess, retrieval error, reasoning error). Can a fine-grained error taxonomy improve next-step decisions and guide targeted query reformulation?
- Critic calibration on multi-hop tasks: The Critic exhibits very low true-positive acceptance on HotpotQA/2Wiki, likely causing over-retrieval. Which calibration strategies (class rebalancing, cost-sensitive loss, temperature/threshold tuning, uncertainty estimation, selective prediction) yield better acceptance/rejection trade-offs?
- Transferability across Reasoners and retrievers: The Critic is trained on traces from a specific LLM and BM25 retriever. How well does it transfer to other LLMs, retrievers, and retriever configurations? Can a meta-critic be trained to be robust across models/tools?
- Applicability without gold answers: Self-practicing requires (Q, A) pairs. How can the approach be extended to settings without ground-truth answers (e.g., weak supervision, distant supervision, self-consistency filtering across models, or synthetic answer verification)?
- Retrieval backbone restrictions: Experiments use BM25; several baselines use stronger dense/hybrid retrievers or rerankers. Do SIM-RAG’s gains persist or improve with dense/hybrid retrieval, reranking, or web search, and can the Critic jointly optimize with these components?
- Untapped signal for query learning: Self-practice traces are not used to optimize query generation or retrieval policies. Can the same traces train a query planner/retrieval policy (e.g., RL or supervised learning) jointly with the Critic to reduce error propagation?
- Stopping rule design: Current stopping relies on a binary Critic verdict or a fixed max-turn cap (T). Can probabilistic, cost-aware stopping rules (expected value of information, budgeted decision-making) improve both quality and efficiency?
- Unanswerable/out-of-corpus detection: The system may loop until max turns for questions not answerable from the corpus. How can the Critic detect unanswerable cases early and trigger abstention, deferral, or user clarification?
- Actionable feedback from the Critic: The Critic only returns a binary verdict. Can it generate structured, actionable guidance (e.g., missing entities/relations, suggested facets) to steer the next retrieval step and reduce trial-and-error?
- Robustness to distractors and adversarial noise: The Critic’s behavior under noisy or contradictory contexts is not evaluated. How robust is it to distractor passages, and can adversarial training/hard-negative mining improve resilience?
- Faithfulness and attribution evaluation: Results focus on EM/F1; evidence grounding/faithfulness (e.g., attribution, entailment of answer by cited passages) is not measured. Do SIM-RAG’s “sufficient” answers consistently align with retrieved evidence?
- Efficiency and budget-awareness: Average number of rounds, token usage, latency, and monetary cost are not reported. What are the quality–cost trade-offs, and can budget-aware policies dynamically adapt retrieval depth?
- Domain, multilingual, and multimodal generalization: All evaluations are on English Wikipedia text. How does the method perform on domain-specific corpora (legal, medical), non-English languages, and non-text modalities (tables, code, images)?
- Continual learning and model drift: The Critic is system-specific; API LLMs and corpora evolve. What strategies (online learning, periodic self-practice refresh, drift detection) maintain alignment between Critic and changing Reasoners/Retrievers?
- Safety and bias considerations: Does the Critic reduce or inadvertently accept harmful, biased, or policy-violating outputs? How should sufficiency criteria incorporate safety constraints and sensitive-topic handling?
- Interpretability and diagnostics: The Critic does not expose rationales or highlight supporting spans. Can interpretable critics (rationale generation, evidence highlighting) aid debugging, trust, and model governance?
- Synthetic data distribution imbalance: Self-practice likely yields skewed Accept/Reject ratios differing by dataset. What sampling/reweighting/curriculum strategies mitigate imbalance and improve generalization?
- Prompt and feedback sensitivity: The approach relies on prompt engineering for both Reasoner and Critic and on specific Accept/Reject phrasings. How robust are results to prompt variations, and can standardized or learned prompt templates stabilize performance?
- Evaluation parity with baselines: Some baselines use stronger retrievers or different settings. Can future comparisons normalize retrievers/corpora and include statistical significance tests to isolate the effect of the Critic?
Collections
Sign up for free to add this paper to one or more collections.