Sharia-Compliant Chatbot
- Sharia-compliant chatbots are AI systems designed to provide Islamic consultation by integrating authenticated religious texts and ensuring doctrinal fidelity.
- They employ CRISP-DM methodology with advanced semantic retrieval, reinforcement learning, and prompt engineering to deliver rapid and accurate responses.
- Evaluation protocols combine quantitative metrics and human reviews to achieve high semantic accuracy and maintain jurisprudential and cultural consistency.
A Sharia-compliant chatbot is an AI consultation medium engineered to answer questions about Islam while rigorously adhering to doctrinal, jurisprudential, and cultural constraints rooted in classical sources. Distinguished from general-purpose conversational agents, these systems integrate curated Islamic corpora, enforce citation integrity, maintain sensitivity to fiqh, aqidah, ibadah, and muamalah domains, and implement technical guardrails to avoid hallucination and doctrinal inconsistency. A Sharia-compliant chatbot operationalizes advances in NLP (semantic retrieval, LLMs), reinforcement learning, and rigorous prompt engineering, serving as a digital bridge between traditional scholarship and contemporary AI-driven knowledge dissemination (Uriawan et al., 18 Dec 2025, Alan et al., 2024, Mushtaq et al., 28 Oct 2025).
1. Methodological Framework: CRISP-DM and System Pipeline
The design and deployment of a Sharia-compliant chatbot are structured around the CRISP-DM (Cross-Industry Process for Data Mining) methodology:
- Business Understanding: Stakeholders require rapid, authoritative answers to complex Islamic queries with defined success metrics (≥ 85% semantic accuracy, < 3s latency, ≥ 80% user satisfaction).
- Data Understanding: Initial corpus audits involve up to 32,000 QA pairs covering fiqh, ibadah (~47%), muamalah (~23%), aqidah (~15%), akhlak (~10%), and tafsir/history (~5%), sourced from IslamQA.info, AboutIslam.net, and authenticated fatwa archives (Uriawan et al., 18 Dec 2025).
- Data Preparation: Filtering yields 25,000 high-quality QA pairs (Qur’an, sahih Hadith, classical fatwas). Processing steps include HTML stripping, normalization, tokenization (NLTK/Sastrawi), lemmatization, stopword removal, and semantic deduplication (cosine similarity threshold >0.9). Long answers are summarized (TF-IDF extraction) to 2–3 sentences. Split is stratified (80% train, 20% test).
- Modeling: Hybrid semantic retrieval (Sentence-Transformers, vectors) combined with reinforcement learning (Q-Learning on nearest-neighbor discrete state × answer actions, reward by cosine similarity).
- Evaluation: Both automatic (semantic accuracy: 87%; latency: <3s) and UAT (clarity, trust, ease-of-use) support robust system-level validation (Uriawan et al., 18 Dec 2025).
- Deployment: Flask REST API backend serves a mobile Flutter frontend, with continuous logging and incremental updates.
The operational pipeline includes a crawler for periodic dataset augmentation, advanced preprocessing modules for English/Indonesian text, vector-based semantic embedding, Q-learning agent, RESTful API endpoints, and stateful cross-platform frontends.
2. Corpus Construction and Data Integrity
Effective Sharia compliance demands provenance from authoritative sources:
- Corpus Sourcing: QA pairs are explicitly derived from Qur’anic verses, authenticated Hadith (Bukhari, Muslim), and scholarly fatwa websites. Manual vetting of random samples (~300) ensures religious authenticity and doctrinal fidelity (Uriawan et al., 18 Dec 2025).
- Flexible Data Schema: Each QA unit is a JSON object annotated with: unique ID, question and answer text, category (fiqh, ibadah, etc.), source references with type and location (Qur’an, Hadith), language, and optional embedding placeholder. Extensibility supports fields for region, difficulty, or madhhab.
- Text Processing: Steps include HTML tag stripping, lowercasing, punctuation normalization, tokenization, lemmatization/stemming, stopword removal, and semantic deduplication (cosine similarity calculations). Stratified sampling ensures domain-representative train/test splits.
In retrieval-augmented systems (e.g., MufassirQAS), Turkish-language open-access books (“Kuran Yolu Türkçe Meal ve Tefsir,” “Kütüb-i Sitte,” “İslam İlmihali”) are chunked (2000 tokens, overlapping by 100) and indexed for vector retrieval and citation mapping (Alan et al., 2024).
3. Semantic Retrieval, RL, and Prompt Engineering
Semantic Embedding and Retrieval
Pretrained sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) convert cleaned queries into 384-dimensional vectors. Similarity search among corpus pairs uses the cosine similarity metric:
No domain-specific fine-tuning is performed unless explicitly designed; in narrow domains, held-out pairs may be used for light adaptation (2 epochs, learning rate ) (Uriawan et al., 18 Dec 2025).
Reinforcement Learning Integration
Q-Learning enables adaptive answer selection in discrete state-action space:
- States: IDs of top-k semantically nearest QA pairs.
- Actions: Candidate answers from the corpus.
- Update Rule:
Hyperparameters: (learning rate), (discount factor), -greedy exploration decaying over 10,000 interactions.
Reward scheme:
- for high semantic similarity ( 0.8),
- for moderate (),
- for low (),
- bonus for positive user feedback (Uriawan et al., 18 Dec 2025).
RAG and Prompt Guardrails
Retrieval-Augmented Generation (RAG) pipelines (MufassirQAS) couple similarity-ranked chunk retrieval with LLM-based generation, enclosing system input in a custom prompt:
- Instruct the chatbot to exclusively reference provided Qur’an or Hadith excerpts
- Require explicit citations (book, page, line)
- Mandate refusal if authoritative reference is absent or similarity below threshold
- Neutralize offensive/sectarian language
Pseudocode logic:
1 2 3 4 |
if user_query.matches_any(forbidden_topics):
return "I’m sorry, I cannot help with that."
if top_chunk.similarity < θ:
return "I do not have enough information in my sources to answer that." |
4. Evaluation Protocols and Faithfulness Metrics
Quantitative and Qualitative Evaluation
Semantic accuracy is benchmarked by functional testing against 100 novel queries; 87% returned answers with cosine similarity to ground truth. Error rates: fiqh 5%, aqidah 3%, ibadah 2%, muamalah 3% (Uriawan et al., 18 Dec 2025).
Agent-based frameworks (Mushtaq et al., 28 Oct 2025) deploy dual evaluation protocols:
- Quantitative Agent: Citation verification (confirmed/partially confirmed/unverified/refuted), six-dimensional rubric (structure, clarity, originality, Islamic accuracy, citation, cultural consistency), aggregation:
- Qualitative Agent: Pairwise side-by-side scoring (tone, depth, originality, jurisprudential soundness, cultural sensitivity), win counting.
Model-wise Scores (Selected Table)
| Model | Islamic Accuracy | Citation Quality | Overall Mean |
|---|---|---|---|
| GPT-4o | 3.93 | 3.38 | 3.90 |
| Ansari AI | 3.68 | 3.32 | 3.79 |
| Fanar | 2.76 | 1.82 | 3.04 |
Best verdicts are dominated by Ansari AI and GPT-4o in jurisprudential consistency and tone; Fanar is more regionally adapted but outputs suboptimal citation quality (Mushtaq et al., 28 Oct 2025).
Faithfulness and Guardrails
MufassirQAS system prompts require neutral citation-backed answers and explicit refusal where references are lacking:
"If you cannot back up with a verse or hadith from the retrieved texts, politely decline rather than guess." (Alan et al., 2024)
Scholar evaluation yields faithfulness 9.2/10 (MufassirQAS) versus 7.1/10 (ChatGPT-3.5), precision 0.94 vs. 0.78, citation rates 100% vs. 20% (Alan et al., 2024).
Failure modes in generic LLMs include hallucinated verse references, misapplied jurisprudence, and culturally inconsistent tone.
5. Implementation, Deployment, and UX Design
Backend and API
The chatbot core is encapsulated in a Flask REST API:
- Key endpoints:
/ask,/feedback,/session/new - Middleware caches Q-table/sentence-transformer; embedding cache optimizes top-k search
Mobile Frontend
The Flutter UI combines:
- ChatListView with alternating chat bubbles
- Statemanagement via Provider/Bloc
- Asynchronous HTTP calls to
/askendpoint - Theming with olive green/ivory, crescent iconography, Arabic-inspired sans-serif font
Error handling includes network retries and user feedback via toast notifications (Uriawan et al., 18 Dec 2025).
6. Limitations and Enhancement Pathways
Observed Constraints
- Static Q-table: No online adaptation post-deployment, leading to drift.
- Dataset dependency: Insufficient coverage of emergent fiqh or modern issues.
- Single-turn: Lacks dialogue state, restricting context-aware multi-turn interaction.
- Tabular scalability: Q-Learning in discrete state/action space does not scale to very large or continuous corpora (Uriawan et al., 18 Dec 2025, Alan et al., 2024).
Proposed Solutions
- Continuous Learning: RLHF to update Q-values in production, DQN integration for continuous embedding space.
- Multi-Turn Dialogue: RNN/transformer-based state tracking, expanded JSON schema for context linkage.
- Corpus Expansion: Automated crawler/human vetting for QA pair growth, multilingual SBERT fine-tuning.
- Infrastructure Scaling: Dockerized API deployment in Kubernetes, FAISS for fast vector retrieval.
- Accessibility: Speech-to-text, OCR for ingesting fatwa PDFs.
Enhanced juristic coverage via Arabic tafsir, chain-of-jurisprudence modules, and madhhab-specific reasoning are recommended for doctrinal breadth. Continuous scholar oversight, versioned audit logs, and open-source benchmark sets are indispensable for sustained Sharia compliance (Uriawan et al., 18 Dec 2025, Mushtaq et al., 28 Oct 2025).
7. Recommendations for Faithful Sharia Compliance
To maximize doctrinal fidelity and community acceptance:
- Rigorous Citation: Enforce in-text citation schema (e.g., Qur’an 2:256, hadith isnad). Integrate automatic verification pipelines for reference correctness.
- Jurisprudential Consistency: Tag every ruling by legal school and indicate intramadhhab divergences.
- Tone Control: Style guides should mandate respectful invocation, prohibition of slang, and culturally consonant phrasing.
- Human-in-the-loop Oversight: Regular external scholar review and versioned logs for correction and transparency.
- Community Benchmarks: Deploy prompt sets and evaluation panels encompassing madhhab, register, and regional variation.
Adhering to these recommendations, Sharia-compliant chatbots will optimize faithfulness, transparency, and accessibility in digital Islamic consultation (Uriawan et al., 18 Dec 2025, Alan et al., 2024, Mushtaq et al., 28 Oct 2025).