Agentic Quran-Grounding Framework
- Agentic Quran-Grounding Framework is a domain-specialized RAG model that employs multi-turn, tool-based reasoning to generate verifiable, Quran-cited answers.
- It integrates dense retrieval, structured tool calls, and iterative evidence aggregation to significantly reduce hallucination and boost factual accuracy.
- The system demonstrates superior performance over traditional RAG methods, validated through rigorous ablation studies and quantitative gains in accuracy.
The Agentic Quran-Grounding Framework (Agentic RAG) is a domain-specialized adaptation of Agentic Retrieval-Augmented Generation designed to facilitate faithful, verifiable Islamic question answering grounded in the Qur’an. It leverages multi-turn tool-based reasoning and agentic orchestration—rather than static or linear retrieval—enabling dynamic evidence seeking, explicit verse citation, and the mitigation of ungrounded or hallucinatory outputs. Agentic RAG architectures combine dense retrieval over a verse-level corpus, instruction-tuned LLM controllers, structured tool interfaces, and multi-stage answer revision protocols, achieving state-of-the-art factual accuracy and reduced hallucination rates compared to standard or single-shot RAG methods in Islamic QA contexts (Bhatia et al., 12 Jan 2026, Singh et al., 15 Jan 2025).
1. Core Principles and Architectural Foundations
Traditional RAG systems operate by issuing a single or fixed set of retrievals from a knowledge base and then prompting an LLM to generate an answer using the retrieved context. While effective in many verticals, this design exhibits critical deficiencies in high-stakes domains, including noisy or insufficient retrieval, weak query-document alignment, and ungrounded output generation. Enhanced RAG introduces modular routers, query rewriters, and rerankers, explicitly addressing these failure points. Agentic RAG, by contrast, empowers the LLM itself to orchestrate the workflow, selecting among actions such as retrieval, rewriting, and answer generation, and iteratively deciding whether to proceed or terminate (Ferrazzi et al., 12 Jan 2026).
In the domain of Quranic QA, Agentic RAG is explicitly defined by an instruction-following LLM capable of issuing and sequencing explicit, structured tool calls to a verse-indexed corpus. The model plans its evidence-gathering steps, forcing the grounding of every claim in specific ayat fetched via constrained APIs, thus maximizing factual faithfulness and traceability (Bhatia et al., 12 Jan 2026). The resultant control flow enables corrective evidence seeking—a capability unattainable in standard or enhanced RAG pipelines (Singh et al., 15 Jan 2025).
2. Iterative Evidence Seeking, Tool Orchestration, and Control Flow
Agentic Quran-Grounding employs a multi-module architecture:
- Retrieval Module: An indexed Quran corpus (~6,236 atomic ayat), accessed via dense retrieval (e.g., mE5-base) that returns top-k verse candidates.
- Agentic Controller: An instruction-tuned LLM that receives the user's query and, in each evidence-seeking turn, produces a structured "ToolCall" specifying which retrieval or knowledge-access action to take.
- Tool Interface: A JSON-based API surface, restricting actions to a fixed schema (e.g.,
semanticSearch,readAyah,getSurahInfo,searchWithinSurah). Each tool call mandates transparent argumentation and output, preventing free-form, uncredited retrieval mixing. - Answer Generator: After evidence aggregation, the LLM synthesizes a cited answer, explicitly referencing the collected ayat.
The iterative loop can be formalized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
procedure AgenticRAG(query): state ← initialize with query evidence ← [] for turn in {1,…,T_evidence} do plan ← LLM.generatePlan(state) if plan.callTool then result ← callTool(plan.toolName, plan.args) evidence.append(result) state.update(evidence) else break finalAnswer ← LLM.generateAnswer(state, evidence) return finalAnswer |
This explicit orchestration eliminates reliance on hardwired routers or rerankers, delivering a complete agentic loop, with each step accountable to the orchestrator's policy $\pi(s_t)\in\{\textsc{Rewrite}, \textsc{Retrieve}, \textsc{Generate}, \textsc{End}\}$ (Ferrazzi et al., 12 Jan 2026).
3. Retrieval and Answer Revision Algorithms
Evidence-backed answer generation in Agentic RAG consists of a multi-stage revision process:
- Single-Shot Retrieval: The system performs an initial dense retrieval, concatenating the resulting contexts and generating a preliminary answer.
- Agentic Evidence Seeking: Multiple rounds of explicit tool calls are planned and executed, aggregating specific verse texts and metadata as evidence.
- Synthesis: A final answer is composed, with all conclusions and citations traceably linked to the retrieved ayat.
The LaTeX workflow for this process is formulated as: (Bhatia et al., 12 Jan 2026)
Evidence seeking is terminated based on either exhausting a pre-set maximum number of evidence turns or via the policy's termination decision. Each agentic tool call updates the agent's state and working context, ensuring all generated outputs are grounded in explicitly fetched evidence.
4. Training, Data Resources, and Quantitative Impact
The system is trained with a substantial Quran-centric dataset ecosystem:
- Supervised Fine-Tuning (SFT) Reasoning Pairs: 25,000 Arabic instruction–response pairs, each grounded in Quran or Hadith evidence. Next-token prediction is optimized via cross-entropy loss; fine-tuning hyperparameters include learning rate 5e-5, and standard batch sizes.
- Reward Learning (RL) Preferences: 5,000 bilingual samples involving gold-standard Q&A with LLM judge scoring. Optimization employs Group Sequence Policy Optimization (GSPO) at a learning rate of 3e-6. Reward models are calibrated for accuracy, clarity, and completeness.
- Verse Corpus: A collection of 6,236 normalized ayat, each with surah/ayah metadata, indexed with mE5-base or Arabic-tuned SBERT embeddings.
- Retrieval Protocol: Top-5 semantic retrieval per user query, two evidence turns, and deterministic LLM temperatures for reproduction.
Ablation studies demonstrate significant empirical improvements: Agentic RAG increases average accuracy by 5.80–10.05 percentage points versus single-shot RAG, raising Qwen3-4B from 38.85% (+RAG) to 48.90% (+Agentic RAG), and Fanar-2-27B from 51.50% to 57.30%. These results confirm superior correctness, tighter Arabic–English robustness, and lower hallucination rates (Bhatia et al., 12 Jan 2026).
| Model / Variant | Avg. %Correct |
|---|---|
| Qwen3-4B-2507 (base) | 21.85 |
| +SFT | 30.55 |
| +SFT + RL | 30.83 |
| +RAG | 38.85 |
| +Agentic RAG | 48.90 |
| Fanar-2-27B (base) | 48.05 |
| +RAG | 51.50 |
| +Agentic RAG | 57.30 |
Mean accuracy across both Arabic and English settings; single-gold evaluation (Bhatia et al., 12 Jan 2026).
5. Specialized Tooling, System Design Patterns, and Grounding Strategies
Agentic Quran-Grounding relies on a set of design constructs directly tied to agentic RAG theory:
- Structured Tool Calls: All evidence is fetched with explicit, schema-bound invocations. JSON-encoded API signatures ensure transparency and verifiability.
- Reflection and Critique: Post-generation, the LLM may be prompted to self-critique answer fidelity against cited ayat, reducing misinterpretation and unwarranted generalization (Singh et al., 15 Jan 2025).
- Multi-Tier Agentic Workflows: Architectures may scale from single agent orchestrators (router/plan/execute) to hierarchies and multi-agent collectives. The design enables dynamic switching between verse, translation, and tafsir retrieval, multi-hop reasoning (e.g., connecting ayat with shared roots), and complex evidence synthesis (Singh et al., 15 Jan 2025).
- Answer Synthesis: Final outputs must directly cite verses, and, where possible, link underlying evidence, constraining the LLM from free-form, ungrounded generation.
6. Performance, Cost, and Limitations
Empirical results show Agentic RAG provides substantial correctness and robustness gains, but introduces new cost and latency considerations. The agentic loop—via iterative tool calls and reflective reasoning—incurs 1.5–2.7× the token and latency budget of simple RAG, requiring explicit bounding of evidence turns (typically set to 2–3) (Ferrazzi et al., 12 Jan 2026).
Key limitations include:
- Single-Gold Evaluation: The framework is evaluated on atomic gold answers, not capturing multi-madhāhib or disputed fiqh rulings.
- Judge Dependence: Model selection and ranking are partially dependent on LLM-as-judge protocols (κ=0.51), which only achieve moderate agreement with human experts.
- Quran-Centric Scope: Current iterations only ground in Qur’an ayat; authenticated hadith and legal commentary are not directly integrated.
- Latency: Multi-turn orchestration, reflective critique, and constrained schema introduce non-negligible inference latencies.
- Potential for Tool Misuse: Failure to appropriately sequence or interpret tool outputs may yield degraded answers or failure to terminate.
A plausible implication is that agentic orchestration, while increasing domain traceability and accuracy, must be carefully managed to avoid unnecessary cost escalations; the literature suggests supplementing Agentic RAG with explicit rerankers or human-in-the-loop verification for higher reliability in broad or ambiguous domains (Ferrazzi et al., 12 Jan 2026).
7. Future Directions and Research Outlook
Potential extensions include:
- Multi-Source Grounding: Integration of authenticated hadith and broader fiqh sources, with explicit provenance-aware retrieval.
- Diversity of Rulings: Support for multi-madhāhib (jurisprudential schools) and the handling of interpretive diversity via multi-reference grading.
- Adversarial Robustness: Strengthening the system against tool misuse, retrieval adversarial attacks, and ambiguous inputs.
- Human–Agent Collaboration: Automated flagging of controversial outputs for ulama (scholar) review, and expansion of quality measures beyond LLM judges.
The Agentic Quran-Grounding Framework exemplifies the capabilities and risks of integrating agentic RAG into high-stakes, domain-specific reasoning applications, serving as both a technical model and a case study in verifiable, responsible AI (Bhatia et al., 12 Jan 2026, Ferrazzi et al., 12 Jan 2026, Singh et al., 15 Jan 2025).