Self-MedRAG: Iterative Medical QA

Updated 16 February 2026

Self-MedRAG is a hybrid self-reflective framework that iteratively refines medical answers using both BM25 and dense retrieval via reciprocal rank fusion for improved evidence coverage.
It integrates explicit rationalization and automated self-reflection modules to verify evidentiary support, thereby reducing unsupported claims in complex clinical scenarios.
The iterative hypothesis-verification process boosts accuracy on MedQA and PubMedQA benchmarks, demonstrating robust multi-hop reasoning in medical QA.

Self-MedRAG is a self-reflective hybrid Retrieval-Augmented Generation (RAG) framework designed to improve the reliability of medical question answering (QA) by closely emulating the iterative hypothesis-verification processes typical of clinical reasoning. Standard RAG approaches have demonstrated utility in grounding LLM outputs with external knowledge but remain limited by single-pass retrieval, which frequently fails to support multi-step inference required in complex biomedical scenarios. Self-MedRAG addresses these shortcomings by combining hybrid retrieval, explicit rationalization, automated self-reflection, and query reformulation, thereby minimizing unsupported claims and enhancing clinical applicability (Ryan et al., 8 Jan 2026).

1. Hybrid Retrieval Architecture

Self-MedRAG maximizes evidence coverage for medical queries by fusing both sparse and dense retrieval mechanisms through Reciprocal Rank Fusion (RRF). The sparse retriever employs BM25 with hyperparameters $k_1 = 1.5$ , $b=0.75$ , prioritizing high-precision lexical matches, particularly effective for technical medical terminology. Dense retrieval is accomplished via Contriever-MSMARCO, a transformer-based model that encodes text into 768-dimensional vectors, leveraging dot-product similarity and FAISS for efficient passage search.

The RRF method combines the strengths of both retrievers. For each candidate passage $d$ , the RRF score is calculated as

$\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$

with $K=60$ , following prior best practices to manage the influence of low-ranked results. The top $N=100$ fused passages constitute the evidence context $C_i$ at each iteration. This hybridization ensures enhanced retrieval recall, critical for resolving questions involving multi-hop reasoning.

2. Generation and Explicit Rationalization

A DeepSeek LLM (LLM; 530M parameters, fine-tuned on in-domain QA) serves as the generative backbone. The model receives a structured prompt comprising the current query $Q_i$ , retrieved context $C_i$ , explicit system instructions (such as “cite evidence” and “avoid unsupported claims”), and a history $H$ tracking previous queries, contexts, answers, and rationales. On invocation, the generator produces an explicit answer $b=0.75$ 0 (e.g., yes/no, multiple-choice) alongside a rationale $b=0.75$ 1, decomposed into a sequence of statements each attributed to specific supporting passages within $b=0.75$ 2. This rationale format facilitates subsequent support verification and supports transparent reasoning chains, aligning with evidence-based clinical standards.

3. Self-Reflection and Automated Criticism

Self-MedRAG employs a lightweight automated self-reflection module to verify the evidentiary grounding of each rationale statement. Two alternative models are supported: RoBERTa-large-MNLI (340M parameters, NLI-based) and Llama-3.1-8B (LLM-based). For each statement $b=0.75$ 3 and each context passage $b=0.75$ 4, an entailment score $b=0.75$ 5 is computed. Statements are considered “supported” if $b=0.75$ 6. The overall rationale support score is defined as

$b=0.75$ 7

Acceptance of the generated answer and rationale requires $b=0.75$ 8. If this criterion is unmet, the specific unsupported statements are extracted to guide further iterations.

4. Iterative Hypothesis-Verification Process

To reduce unsupported or hallucinated claims, Self-MedRAG mimics iterative clinical reasoning via a multi-step hypothesis-verification loop. The process iteratively alternates between retrieval, generation, support verification, and query refinement:

Initialize with a user question $b=0.75$ 9 and empty history $d$ 0; set iteration index $d$ 1.
Retrieve context $d$ 2 using the hybrid BM25/Contriever/RRF scheme.
Generate answer $d$ 3 and rationale $d$ 4 conditioned on $d$ 5, $d$ 6, and $d$ 7.
Compute $d$ 8 via NLI or LLM-based support checking.
If $d$ 9, accept $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 0. Otherwise, derive unsupported statements $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 1 from $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 2, append clarificatory prompts to form $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 3, update $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 4, and increment $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 5.
Repeat until either a supported rationale is achieved or the maximum number of iterations $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 6 is reached.

This process was designed based on empirical evidence of diminishing accuracy returns beyond three iterations.

5. Corpus, Preprocessing, and Implementation

The retriever corpus comprises approximately 20 million medical abstracts and clinical guidelines from PubMed, indexed for both BM25 (lexical) and FAISS (dense) search. For BM25, standard text normalization (lowercasing and punctuation removal) is applied, while dense retrieval utilizes raw, untokenized text for embedding fidelity. The generator (DeepSeek) adopts a chain-of-thought prompt formatting and is fine-tuned on in-domain medical QA data, with no extra supervised fine tuning beyond established checkpoints. Hyperparameters for retrieval ( $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 7, $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 8 for BM25, top 100 per retriever), RRF ( $\mathrm{RRF}(d) = \sum_i \frac{1}{K + \mathrm{rank}_i(d)}$ 9, truncate at top 100 passages), and support thresholds ( $K=60$ 0, $K=60$ 1) were established using held-out validation for optimal precision–recall performance.

The self-reflection NLI critic uses RoBERTa-large-MNLI, while the LLM critic employs Llama-3.1-8B with explicit entailment prompts. Both follow the same entailment scoring and aggregation paradigm.

6. Evaluation and Empirical Performance

Self-MedRAG was assessed on MedQA (1,000 multiple-choice USMLE questions) and PubMedQA (1,000 yes/no research-abstract questions). Metrics include percentage accuracy and F1 score. Several baselines and system configurations were compared:

System	PubMedQA Acc. / F1	MedQA Acc. / F1
BM25 only (base RAG)	66.80 / 41.74	60.67 / 41.92
Contriever only (base RAG)	67.90 / 43.30	64.41 / 41.15
BM25+Contriever+RRF (hybrid base)	69.10 / 64.45	80.00 / 79.93
Self-MedRAG + NLI (roberta-MNLI)	79.82 / 78.40	83.33 / 83.30
Self-MedRAG + LLM (Llama-3.1-8B)	78.76 / 77.31	82.90 / 82.90

The addition of the self-reflective loop yielded significant accuracy improvements: MedQA accuracy increased from 80.00% (hybrid base) to 83.33% (+3.33 points), and PubMedQA from 69.10% to 79.82% (+10.72 points). Iterative gains were most pronounced in the first two steps (PubMedQA: 69.8% to 83.3%; MedQA: 79.3% to 86.1%), with diminishing returns observed by the third iteration.

7. Significance and Implications

Self-MedRAG demonstrates that a hybrid retrieval architecture, explicit evidence-linked rationalization, and automated, iterative self-reflection can substantially reduce unsupported LLM claims and improve outcomes on complex, multi-step clinical questions (Ryan et al., 8 Jan 2026). The framework's adoption of Reciprocal Rank Fusion ensures comprehensive evidence retrieval, while the iterative critic-guided process directly targets the key failure mode of unsupported reasoning. The system's superior empirical performance suggests that such architectures can materially improve the trustworthiness of LLMs in high-stakes biomedical and clinical knowledge tasks.

A plausible implication is that the iterative hypothesis-verification paradigm exemplified by Self-MedRAG may generalize beyond medical QA to other domains where verifiability and multi-hop reasoning are critical. The demonstrated accuracy gains and evidence coverage suggest promising future directions for robust, explainable LLM-based systems in technical and safety-critical settings.

Markdown Report Issue Upgrade to Chat

References (1)

Self-MedRAG: a Self-Reflective Hybrid Retrieval-Augmented Generation Framework for Reliable Medical Question Answering (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-MedRAG.