Self-MedRAG: Iterative Medical QA
- Self-MedRAG is a hybrid self-reflective framework that iteratively refines medical answers using both BM25 and dense retrieval via reciprocal rank fusion for improved evidence coverage.
- It integrates explicit rationalization and automated self-reflection modules to verify evidentiary support, thereby reducing unsupported claims in complex clinical scenarios.
- The iterative hypothesis-verification process boosts accuracy on MedQA and PubMedQA benchmarks, demonstrating robust multi-hop reasoning in medical QA.
Self-MedRAG is a self-reflective hybrid Retrieval-Augmented Generation (RAG) framework designed to improve the reliability of medical question answering (QA) by closely emulating the iterative hypothesis-verification processes typical of clinical reasoning. Standard RAG approaches have demonstrated utility in grounding LLM outputs with external knowledge but remain limited by single-pass retrieval, which frequently fails to support multi-step inference required in complex biomedical scenarios. Self-MedRAG addresses these shortcomings by combining hybrid retrieval, explicit rationalization, automated self-reflection, and query reformulation, thereby minimizing unsupported claims and enhancing clinical applicability (Ryan et al., 8 Jan 2026).
1. Hybrid Retrieval Architecture
Self-MedRAG maximizes evidence coverage for medical queries by fusing both sparse and dense retrieval mechanisms through Reciprocal Rank Fusion (RRF). The sparse retriever employs BM25 with hyperparameters , , prioritizing high-precision lexical matches, particularly effective for technical medical terminology. Dense retrieval is accomplished via Contriever-MSMARCO, a transformer-based model that encodes text into 768-dimensional vectors, leveraging dot-product similarity and FAISS for efficient passage search.
The RRF method combines the strengths of both retrievers. For each candidate passage , the RRF score is calculated as
with , following prior best practices to manage the influence of low-ranked results. The top fused passages constitute the evidence context at each iteration. This hybridization ensures enhanced retrieval recall, critical for resolving questions involving multi-hop reasoning.
2. Generation and Explicit Rationalization
A DeepSeek LLM (LLM; 530M parameters, fine-tuned on in-domain QA) serves as the generative backbone. The model receives a structured prompt comprising the current query , retrieved context , explicit system instructions (such as “cite evidence” and “avoid unsupported claims”), and a history tracking previous queries, contexts, answers, and rationales. On invocation, the generator produces an explicit answer (e.g., yes/no, multiple-choice) alongside a rationale , decomposed into a sequence of statements each attributed to specific supporting passages within . This rationale format facilitates subsequent support verification and supports transparent reasoning chains, aligning with evidence-based clinical standards.
3. Self-Reflection and Automated Criticism
Self-MedRAG employs a lightweight automated self-reflection module to verify the evidentiary grounding of each rationale statement. Two alternative models are supported: RoBERTa-large-MNLI (340M parameters, NLI-based) and Llama-3.1-8B (LLM-based). For each statement and each context passage , an entailment score is computed. Statements are considered “supported” if . The overall rationale support score is defined as
Acceptance of the generated answer and rationale requires . If this criterion is unmet, the specific unsupported statements are extracted to guide further iterations.
4. Iterative Hypothesis-Verification Process
To reduce unsupported or hallucinated claims, Self-MedRAG mimics iterative clinical reasoning via a multi-step hypothesis-verification loop. The process iteratively alternates between retrieval, generation, support verification, and query refinement:
- Initialize with a user question and empty history ; set iteration index .
- Retrieve context using the hybrid BM25/Contriever/RRF scheme.
- Generate answer and rationale conditioned on , , and .
- Compute via NLI or LLM-based support checking.
- If , accept . Otherwise, derive unsupported statements from , append clarificatory prompts to form , update , and increment .
- Repeat until either a supported rationale is achieved or the maximum number of iterations is reached.
This process was designed based on empirical evidence of diminishing accuracy returns beyond three iterations.
5. Corpus, Preprocessing, and Implementation
The retriever corpus comprises approximately 20 million medical abstracts and clinical guidelines from PubMed, indexed for both BM25 (lexical) and FAISS (dense) search. For BM25, standard text normalization (lowercasing and punctuation removal) is applied, while dense retrieval utilizes raw, untokenized text for embedding fidelity. The generator (DeepSeek) adopts a chain-of-thought prompt formatting and is fine-tuned on in-domain medical QA data, with no extra supervised fine tuning beyond established checkpoints. Hyperparameters for retrieval (, for BM25, top 100 per retriever), RRF (, truncate at top 100 passages), and support thresholds (, ) were established using held-out validation for optimal precision–recall performance.
The self-reflection NLI critic uses RoBERTa-large-MNLI, while the LLM critic employs Llama-3.1-8B with explicit entailment prompts. Both follow the same entailment scoring and aggregation paradigm.
6. Evaluation and Empirical Performance
Self-MedRAG was assessed on MedQA (1,000 multiple-choice USMLE questions) and PubMedQA (1,000 yes/no research-abstract questions). Metrics include percentage accuracy and F1 score. Several baselines and system configurations were compared:
| System | PubMedQA Acc. / F1 | MedQA Acc. / F1 |
|---|---|---|
| BM25 only (base RAG) | 66.80 / 41.74 | 60.67 / 41.92 |
| Contriever only (base RAG) | 67.90 / 43.30 | 64.41 / 41.15 |
| BM25+Contriever+RRF (hybrid base) | 69.10 / 64.45 | 80.00 / 79.93 |
| Self-MedRAG + NLI (roberta-MNLI) | 79.82 / 78.40 | 83.33 / 83.30 |
| Self-MedRAG + LLM (Llama-3.1-8B) | 78.76 / 77.31 | 82.90 / 82.90 |
The addition of the self-reflective loop yielded significant accuracy improvements: MedQA accuracy increased from 80.00% (hybrid base) to 83.33% (+3.33 points), and PubMedQA from 69.10% to 79.82% (+10.72 points). Iterative gains were most pronounced in the first two steps (PubMedQA: 69.8% to 83.3%; MedQA: 79.3% to 86.1%), with diminishing returns observed by the third iteration.
7. Significance and Implications
Self-MedRAG demonstrates that a hybrid retrieval architecture, explicit evidence-linked rationalization, and automated, iterative self-reflection can substantially reduce unsupported LLM claims and improve outcomes on complex, multi-step clinical questions (Ryan et al., 8 Jan 2026). The framework's adoption of Reciprocal Rank Fusion ensures comprehensive evidence retrieval, while the iterative critic-guided process directly targets the key failure mode of unsupported reasoning. The system's superior empirical performance suggests that such architectures can materially improve the trustworthiness of LLMs in high-stakes biomedical and clinical knowledge tasks.
A plausible implication is that the iterative hypothesis-verification paradigm exemplified by Self-MedRAG may generalize beyond medical QA to other domains where verifiability and multi-hop reasoning are critical. The demonstrated accuracy gains and evidence coverage suggest promising future directions for robust, explainable LLM-based systems in technical and safety-critical settings.