IslamicFaithQA Overview
- IslamicFaithQA is a framework of computational methods and benchmarks focused on providing doctrinally faithful, evidence-grounded responses to Islamic queries.
- It leverages retrieval-augmented generation paradigms and multilingual datasets, ensuring precise citations from sources like the Qur’an, Hadith, and fatwā.
- Evaluation employs robust metrics such as MAP@10, MRR, and faith-specific standards to ensure safe abstention and avoid speculative claims.
IslamicFaithQA—synthetic Editor's term for “Islamic Faith Question Answering”—designates the family of computational frameworks, benchmarks, and system architectures dedicated to robust, faithful question answering about Islam, Islamic law, and foundational texts, using NLP and LLMs. These systems confront unique requirements: doctrinal precision, obligatory citation to authoritative sources (Qur’an, Hadith, fatwā), explicit handling of missing evidence, and robust abstention from ungrounded or speculative claims. Research in IslamicFaithQA spans closed-domain chatbots, retrieval-augmented LLMs, agentic and iterative retrieval/generation pipelines, comprehensive evaluations reflecting faith-critical criteria, and multilingual/cross-lingual adaptation.
1. Core Datasets and Benchmarks
A central resource is the ISLAMICFAITHQA benchmark, a generative, bilingual (Arabic/English) evaluation set comprising 3,810 question–answer pairs. Each question is paired with a single atomic, factually grounded gold answer, strictly annotated for correctness, hallucination, and abstention. Annotation protocols require concise, single-fact responses with 82.96% inter-annotator agreement and Cohen’s κ of 0.62, with grading via multiple expert annotators and LLM-judge validation (Bhatia et al., 12 Jan 2026). This benchmark exposes aspects often missed by standard MCQ/MRC-style datasets: models are directly penalized for unsupported claims and rewarded for correct abstention (“Not_Attempted”) when evidence is lacking.
Below is a concise summary of major datasets mentioned:
| Name | Modality | Size | Evidence Requirements | Key Metrics |
|---|---|---|---|---|
| ISLAMICFAITHQA | GenQA, AR/EN | 3,810 | Atomic gold, citation | Correct/Incorr/Abstain |
| IslamicPCQA | Persian, PCQA | N/A | Documented, multi-hop | NegRej, Correctness |
| QRCD, ARCD | Extractive | O(1K+) | Spans in Quranic text | F1, pAP, EM, MRR |
| Rezwan (Hadith) | Factoid, AR | 1.2M | Full Hadith, chain | Human/Human+LLM rating |
Data curation frequently includes parallel expertise annotation, rigorous verification, and explicit modeling of unanswerable (“zero-answer”) cases (Bhatia et al., 12 Jan 2026, Basem et al., 2024, Oshallah et al., 29 Jan 2025). Composite benchmarks for inheritance (QIAS 2025 SubTask 1), general knowledge (SubTask 2), and Persian IslamicQA (IslamicPCQA) enable domain-specific, high-fidelity evaluation (Bekhouche et al., 30 Aug 2025, Ahmad et al., 28 Sep 2025, asl et al., 29 Oct 2025).
2. System Architectures: Retrieval and Generation Paradigms
IslamicFaithQA systems predominantly follow Retrieval-Augmented Generation (RAG) paradigms, often extended by agentic control and iterative refinement. Standard RAG employs multi-stage passage selection, typically involving:
- Stage 1: Sparse retrieval (BM25), with full Arabic pre-processing (dediacritization, tokenization), yielding 100s–1000s of initial candidates (Ahmad et al., 28 Sep 2025).
- Stage 2: Dense neural retrieval using language-specific or multilingual embeddings (e.g., Arabic-Triplet-Matryoshka-V2, mE5-base), ranking candidates by cosine similarity (Ahmad et al., 28 Sep 2025, Bhatia et al., 12 Jan 2026).
- Stage 3: Cross-encoder reranking (e.g., miniLMv2, BERT, or SOTA re-rankers), attending jointly to query and passage to assign fine-grained relevance scores (Ahmad et al., 28 Sep 2025, Basem et al., 9 Aug 2025).
- Stage 4: Prompt construction for the LLM—injects retrieved passages under a “RAG CONTEXT” header, and constrains LLM output for answer format and content.
Agentic RAG (Bhatia et al., 12 Jan 2026) extends this process via an explicit interaction loop: an agentic controller issues structured tool calls (search, read, retrieve, re-query), verifies sufficiency of evidence, and iterates retrieval/generation until confident. This iterative loop allows multi-hop reasoning, error correction, and principled abstention when sources are missing or ambiguous. Modularity supports dynamic tool integration—retrievers, readers/generators, and cross-lingual components (asl et al., 29 Oct 2025, Bhatia et al., 12 Jan 2026).
Specialized encoders—AraBERT, MARBERT, QARiB for Arabic, SBERT for Persian, mE5 for English/Arabic—are fine-tuned for dense retrieval, classification, or span extraction depending on corpus and question type (Bekhouche et al., 30 Aug 2025, asl et al., 29 Oct 2025, Basem et al., 2024).
3. Evaluation Protocols and Metrics
Evaluation in IslamicFaithQA employs both standard IR/MRC metrics and custom faith-oriented measurements:
- Mean Average Precision at 10 (MAP@10) and Mean Reciprocal Rank (MRR@10): Assess retrieval ranking of relevant verses/hadith (Basem et al., 2024, Basem et al., 9 Aug 2025, Oshallah et al., 29 Jan 2025).
- Partial Average Precision (pAP@10): For extractive QA, allows partial credit for overlaps between predicted and gold spans (Basem et al., 8 Aug 2025, Basem et al., 9 Aug 2025).
- Faithfulness and Negative Rejection: %Correct, %Incorrect (hallucinated), %Abstain; Negative Rejection Accuracy quantifies safe refusal to answer when evidence is lacking—FARSIQA reports 97.0%, a +40-point improvement over naive RAG (asl et al., 29 Oct 2025).
- LLM-as-Judge and Human Agreement: Scoring correctness, citation fidelity, and faith consistency, either via multi-agent LLM adjudication or human experts (Cohen’s κ ≈ 0.62–0.82) (Asgari-Bidhendi et al., 4 Oct 2025, Mushtaq et al., 28 Oct 2025, Bhatia et al., 12 Jan 2026).
Rigorous evaluation frameworks, such as dual-agent pipelines (quantitative, qualitative) for LLM-generated content, address doctrinal fidelity, citation integrity, and present multi-dimensional scores (structure, clarity, depth, originality, Islamic accuracy, citation accuracy) (Mushtaq et al., 28 Oct 2025).
4. Specialized Challenges and Domain Sensitivity
IslamicFaithQA confronts domain-specific obstacles:
- Linguistic complexity: Bridging Modern Standard Arabic (MSA), Classical Arabic, and vernaculars; managing orthographic (diacritic) ambiguity and cross-lingual mapping for translated corpora (Oshallah et al., 29 Jan 2025, Alnajjar et al., 2022, Basem et al., 2024).
- Evidence granularity: Extracting atomic, precise answers (single ayah, explicit span), multi-hop reasoning, and handling multi-answer/zero-answer queries (Bhatia et al., 12 Jan 2026, Basem et al., 2024, Basem et al., 9 Aug 2025).
- Inheritance reasoning: Ilm al-Mawārith systems require numerically precise, multi-step calculations—often beyond what basic retrieval or vanilla LLMs can achieve. Encoder-based methods with Attentive Relevance Scoring (ARS) offer efficient retrieval, but hybrid or symbolic methods may be necessary for advanced reasoning (Bekhouche et al., 30 Aug 2025, Ahmad et al., 28 Sep 2025).
- Faithful abstention: Robust handling of evidence absence (“Not_Attempted”/negative rejection) is critical to avoid hallucinated religious guidance (Bhatia et al., 12 Jan 2026, asl et al., 29 Oct 2025).
Agentic and iterative approaches (FAIR-RAG, Agentic RAG) offer state-of-the-art performance in faithfulness, with explicit sufficiency checks (Structured Evidence Assessment, SEA) and evidence checklist fulfillment before answer generation (asl et al., 29 Oct 2025, Bhatia et al., 12 Jan 2026).
5. Integration of Source Diversity and Multilingualism
Comprehensive IslamicFaithQA systems index heterogeneous, authoritative sources—Qur’an, Hadith (e.g., Rezwan corpus, 1.2M narrations, chain–matn separated and richly annotated (Asgari-Bidhendi et al., 4 Oct 2025)), fatwā, tafsīr, and modern scholarly writing. Knowledge bases may exceed 1M documents, semantically chunked and indexed via a hybrid sparse/dense fusion (BM25 + neural embeddings + reciprocal rank fusion), with adaptive domain fine-tuning to address specialized theological vocabulary (asl et al., 29 Oct 2025).
Cross-language strategies—via translation and paraphrasing pipelines, as in the Cross-Language Quranic QA approach (Pickthall English translation, paraphrased corpus)—dramatically improve retrieval for languages with mismatched training/testing code (Oshallah et al., 29 Jan 2025). Multilingual models (mBERT, AraBERT, XLM-R) are further domain-adapted with MLM+NSP on religious corpora (Alnajjar et al., 2022). Inclusion and evaluation across Arabic, Persian, English, and additional languages is expanding, as with Rezwan’s Hadith translations (12 languages) and proposals for further South Asian and African language coverage (Asgari-Bidhendi et al., 4 Oct 2025).
6. Design Principles for Faithful, Reliable Deployment
Leading work identifies several best practices and design principles:
- Enforce evidence grounding: All generated claims must cite explicit sources using inline markers ([1], [2]), and no fact may be introduced that is not directly supported.
- Iterative sufficiency checks: Evidence checklists and multi-turn refinement loops ensure no missing or spurious answers (asl et al., 29 Oct 2025).
- Cultural/sectarian awareness: Systems embed major madhhab schemas and prompt for scholarly viewpoint diversity where ambiguity exists (Mushtaq et al., 28 Oct 2025).
- Automated and human-in-the-loop verification: Employ tool-driven citation verification and human review triggers for insufficient or questionable references (Mushtaq et al., 28 Oct 2025, Ahmad et al., 28 Sep 2025).
- Scalable, LLM-as-Judge evaluation: Allows multi-dimensional, faith-oriented, community-reflective rating (Mushtaq et al., 28 Oct 2025, Bhatia et al., 12 Jan 2026).
- Safe handling of legal queries: Proactive disclaimers and error-handling in fatwā/fiqh queries prevent AI-generated “fiat” rulings (asl et al., 29 Oct 2025).
- Efficient, privacy-friendly architectures: Encoder-based solutions enable on-device IslamicFaithQA deployment in sensitive contexts, though at some accuracy tradeoff for heavily compositional reasoning (Bekhouche et al., 30 Aug 2025).
Emergent agentic, adaptive frameworks (Agentic RAG, FAIR-RAG) demonstrate that iterative interaction, sub-querying, and explicit abstention mechanisms are crucial to move from generic retrieval/generation toward truly faithful, reliable IslamicFaithQA (asl et al., 29 Oct 2025, Bhatia et al., 12 Jan 2026).
References:
(Alnajjar et al., 2022, Basem et al., 2024, Oshallah et al., 29 Jan 2025, Basem et al., 8 Aug 2025, Basem et al., 9 Aug 2025, Bekhouche et al., 30 Aug 2025, Ahmad et al., 28 Sep 2025, Asgari-Bidhendi et al., 4 Oct 2025, Mushtaq et al., 28 Oct 2025, asl et al., 29 Oct 2025, Uriawan et al., 18 Dec 2025, Bhatia et al., 12 Jan 2026)