From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

Published 18 Jun 2025 in cs.CL | (2506.15911v2)

Abstract: Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.

Abstract PDF Upgrade to Chat

Summary

Evaluating Islamic-Medicine Responses Using LLM Agents

This paper, "From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents," introduces Tibbe-AG, an innovative framework designed to enhance the accuracy and cultural sensitivity of medical question-answering systems using LLMs. This study addresses a significant gap in the application of AI to culturally grounded medical knowledge, specifically focusing on Islamic medicine as detailed in foundational texts such as Avicenna’s "The Canon of Medicine" and "Prophetic Tibb-e-Nabawi." While these texts offer substantial information on preventive and holistic health practices, they have remained largely underutilized in the context of modern AI-based medical systems.

The authors propose a comprehensive evaluation pipeline, Tibbe-AG, which operationalizes 30 curated questions from Prophetic medicine and assesses responses generated by three LLMs: LLaMA-3, Mistral-7B, and Qwen2-7B under three configurations: direct generation, retrieval-augmented generation (RAG), and the newly proposed agentic framework. This framework incorporates not only retrieval of relevant evidence from foundational medical texts but also involves a secondary validation of the generated answers by an LLM acting as an evaluative agent, resulting in a quantifiable 3C3H quality score—encompassing correctness, completeness, conciseness, helpfulness, harmlessness, and honesty.

Methodological Overview

Tibbe-AG's methodology integrates classical Islamic medical knowledge directly within a modern AI framework. The process involves several critical stages. Initially, the system utilizes a dense retrieval mechanism, ChromaDB, to extract the most pertinent passages from annotated corpora of traditional texts. This retrieval process ensures that the subsequent answers are grounded in authenticated medical teachings.

The framework then employs a layered approach to answer generation. The base LLM initially combines the user query with retrieved content to produce a preliminary answer that harnesses the extraction and summarization prompts, ensuring traceability to the source material. Subsequently, this answer undergoes refinement through a validation stage, prompting the LLM to fact-check the initial response against the retrieved content and integrate mechanistic and safety considerations into the final answer.

To evaluate the generated responses, the paper introduces the 3C3H evaluation metric, which is calculated using an array of judge models, thus verifying the approach's reliability across different evaluative standards.

Results and Analysis

The findings of the study emphasize the superiority of Tibbe-AG over standard direct inference and RAG approaches. Tibbe-AG demonstrated a 13% improvement in factual accuracy and an additional 10% improvement through deeper mechanistic insights and safety considerations. Quantitatively, across various base and judge models, Tibbe-AG consistently outperformed alternatives in the 3C3H scores, demonstrating its effectiveness in not only retrieving and accurately presenting information but also ensuring safety and cultural sensitivity.

Implications and Future Directions

There are both theoretical and practical implications of this research. Theoretically, it illustrates how integrating culturally grounded knowledge into AI systems can enhance the contextual understanding and applicability of automated responses. Practically, this framework has the potential to create more culturally competent AI systems that can augment the educational tools available to scholars and practitioners of Unani medicine, and by extension, serve vast populations relying on these traditional medical systems, particularly in regions where they are historically prevalent.

Future developments could involve scaling this initial setup to more extensive datasets, refining Tibbe-AG's agentic validation component, and conducting comprehensive user studies to enhance AI acceptance within the context of culturally embedded medical practices. This research marks a significant step towards leveraging AI to preserve and disseminate historical medical knowledge, making it accessible and relevant to modern healthcare environments.

Markdown Report Issue