Retrieval-Augmented Generation in Medicine: A Scoping Review of Technical Implementations, Clinical Applications, and Ethical Considerations
Abstract: The rapid growth of medical knowledge and increasing complexity of clinical practice pose challenges. In this context, LLMs have demonstrated value; however, inherent limitations remain. Retrieval-augmented generation (RAG) technologies show potential to enhance their clinical applicability. This study reviewed RAG applications in medicine. We found that research primarily relied on publicly available data, with limited application in private data. For retrieval, approaches commonly relied on English-centric embedding models, while LLMs were mostly generic, with limited use of medical-specific LLMs. For evaluation, automated metrics evaluated generation quality and task performance, whereas human evaluation focused on accuracy, completeness, relevance, and fluency, with insufficient attention to bias and safety. RAG applications were concentrated on question answering, report generation, text summarization, and information extraction. Overall, medical RAG remains at an early stage, requiring advances in clinical validation, cross-linguistic adaptation, and support for low-resource settings to enable trustworthy and responsible global use.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at how a technology called retrieval‑augmented generation (RAG) is being used in medicine. RAG is a way to make AI “smarter” and more trustworthy by letting it look up information from reliable sources (like medical guidelines or research papers) while it writes answers. The authors reviewed 251 studies to see what kinds of medical problems RAG is being used for, how it’s built, how well it works, and what ethical issues it raises.
Key Questions the Paper Asked
- Where does medical RAG get its information, and how does it search for it?
- What kinds of AI models are used with RAG in medicine?
- What medical tasks is RAG helping with, and in which specialties?
- How do researchers judge if RAG’s answers are good and safe?
- What ethical issues—like bias, safety, and fairness across languages and countries—are being addressed or ignored?
How the Study Was Done
This was a scoping review, which means the authors didn’t run a single experiment. Instead, they searched big research databases for studies about RAG in medicine, then summarized what they found.
Think of it like this:
- The authors acted like librarians who browsed many shelves (PubMed, Embase, Web of Science, Scopus) for books (studies) on one topic (RAG in medicine).
- After removing duplicates and reading titles, abstracts, and full papers, they kept 251 studies.
- They extracted key details from each study: data sources, search methods, AI models used, medical tasks, evaluation metrics, and any discussion of ethics.
To understand the tech in everyday terms:
- LLMs: These are AIs that read and write text. They’re great at generating answers but can sometimes make things up or be outdated.
- Retrieval‑Augmented Generation (RAG): Imagine a student taking an open‑book test. Instead of relying only on memory, the student can look up facts in real time. RAG lets an AI do that—first it finds relevant information, then it uses that to write its response.
- Dense vs. Sparse Retrieval:
- Dense retrieval is like searching by meaning. If you ask, “best treatment for high blood pressure,” it tries to find text that means the same thing even if the wording is different.
- Sparse retrieval is like searching by exact words (think simple keyword matching).
- Hybrid mixes both to improve accuracy.
- Embeddings: A way for the AI to turn words into number patterns that capture meaning. Most embeddings used were trained on English, which can be a problem for other languages.
- Knowledge graphs: Imagine a map of medical facts where diseases, drugs, and symptoms are connected by lines. RAG can search these maps to find trustworthy links between ideas.
Main Findings and Why They Matter
Here are the most important takeaways from the 251 studies:
- Data sources:
- About 80% used public data (like PubMed or online guidelines), not private hospital or patient records. This makes RAG easier to test but less personalized to individual patients.
- Retrieval methods:
- Dense retrieval dominated (about 84%), with sparse and hybrid methods used less often.
- Most retrieval tools were built for English. This limits accuracy for non‑English medical data and can worsen global health inequities.
- AI models:
- Proprietary general LLMs (like GPT) were used most.
- Open models (like LLaMA, Gemma, Qwen) were also common.
- Medical‑specific LLMs were rarely used, often because they’re not publicly available or still catching up in performance.
- Medical tasks:
- The top uses were medical question answering, report generation (like radiology summaries), text summarization, and information extraction.
- Question answering helps doctors and patients find evidence quickly but carries risk if sources aren’t checked.
- Report generation and summarization can safely reduce paperwork and improve efficiency.
- Evaluation:
- About half used automated scores (like ROUGE or accuracy).
- Many also used human experts to judge factual correctness and clinical usefulness.
- Very few studies deeply checked for bias (less than 3%), safety issues (~10%), or performance in low‑resource settings (~2%). This is a serious gap because medical AI must be fair and safe for everyone.
- Specialty spread:
- Internal medicine had the most RAG studies.
- Psychiatry, neurology, and radiology had moderate activity.
- Many specialties were barely covered, showing uneven adoption.
Why it matters:
- RAG can make AI answers more up‑to‑date and fact‑based—a big win in medicine.
- But relying mostly on English tools and public data means it may not help all patients equally or handle local guidelines and languages.
- Weak attention to safety and bias could lead to harmful or unfair advice.
What This Means for the Future
The authors conclude that medical RAG is promising but still early. To make it truly helpful and trustworthy in real clinics, they recommend:
- Strong clinical testing: Don’t just check if the text looks good—prove it is accurate and useful for patient care.
- Transparency and traceability: Let users see sources and reasoning so doctors can trust and verify the AI’s answers.
- Better ethics and safety practices: Build systems to detect and reduce bias and harmful content.
- Language and culture fairness: Develop multilingual tools and adapt to local guidelines so RAG works globally, not just in English‑speaking settings.
- Support for low‑resource areas: Make tools that can run with limited data or computing power to reduce health disparities.
- Start with safer tasks: Focus on report generation, summarization, and information extraction, which carry lower risk, before high‑stakes question answering that could influence diagnoses directly.
In short, RAG can help doctors and patients by grounding AI answers in real medical evidence. But to be safe, fair, and useful worldwide, it needs better validation, transparency, multilingual support, and strong ethical guardrails.
Collections
Sign up for free to add this paper to one or more collections.