Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tajweed Detection in Quran Recitations

Updated 1 February 2026
  • Tajweed Detection is the automated recognition and assessment of Quranic recitation rules, ensuring phonological, prosodic, and articulatory compliance.
  • The technique leverages high-quality datasets and feature extraction methods such as MFCCs and mel-spectrograms to achieve fine-grained error diagnosis.
  • State-of-the-art models, including deep neural networks and transformer-based systems, deliver high accuracy and real-time feedback for pronunciation training.

Tajweed Detection is the computational recognition, classification, and evaluation of the phonological, prosodic, and articulation rules (Tajweed) that govern canonical Quranic recitation. Tajweed detection encompasses the automatic identification of correct and incorrect application of specific rules, temporal localization of rule phenomena within recited audio, and, increasingly, fine-grained error diagnosis and pedagogically actionable feedback for learners and researchers. Its technical architecture draws upon phonetics, ASR, deep learning, signal processing, and formal linguistics, with rule sets indexed to classical sources and orthographic/phonetic annotations in script and audio.

1. Taxonomy and Linguistic Substrate of Tajweed Rules

Tajweed rules are a rigorously codified set of phonetic phenomena and articulatory constraints regulating the recitation of Quranic text. Core rules include:

  • Madd (Elongation): Requires the extension of vowel sounds for prescribed durations, e.g., Separate Stretching (Madd Munfasil) enforces four or five harakah units on eligible vowels before hamza.
  • Ghunnah (Nasalization): Encompasses Tight Noon (Noon Mushaddadah Ghunnah), which demands two nasalized time-units on the letter noon with shadda.
  • Ikhfā' (Concealment): The intermediate state of the letter noon or tanwīn between full clarity and fading, with specified nasalization.
  • Idghām, Idhār, Qalqalah, Shaddah/Sukun: Assimilation, clarity, echoing, and gemination/absence of vowel, respectively.

The orthographic representation of these rules (CQO) involves dedicated diacritics and miniature glyphs mapped algorithmically from the Uthmani script. Computational systems must account for context-sensitive script rewrites, geminates, and pausal marks to parse or reconstruct the full phonological layer (Martínez, 16 May 2025).

2. Datasets and Annotation Paradigms

Accessible, high-quality Qur’anic recitation datasets underpin Tajweed detection research. The QDAT corpus, consisting of over 1,500 hand-labeled recordings, provides balanced examples spanning three rules—Madd, Ghunnah, Ikhfā’—with binary correctness labels per utterance (Harere et al., 2023, Shaiakhmetov et al., 30 Mar 2025). The Quran-MD dataset introduces multimodal alignment across 30+ reciters, ~665 hours of verse-level audio, and granular word-level audio-text records, supporting time-aligned Tajweed labeling at both the token and phoneme level with multi-hot or sequence-encoded rule vectors (Salman et al., 25 Jan 2026). Abdelfttah et al. propose the Quran Phonetic Script (QPS), a deterministic mapping from CQO to a 43-symbol phoneme alphabet plus a parallel ṣifat vector that captures 10 articulation attributes per phoneme, automating Tajweed-aware label generation for large-scale training (≈300,000 utterances) (Abdelfattah et al., 27 Aug 2025).

Benchmark design also extends to the QuranMB.v1 corpus, which injects controlled pronunciation confusions into curated verses read by native speakers, enabling evaluation of both error detection and diagnosis capabilities under a standardized phoneme set (Kheir et al., 9 Jun 2025).

3. Feature Extraction and Alignment Pipelines

Canonical Tajweed detection architectures rely on time-frequency representations:

  • MFCCs: Extraction uses standard pipelines: pre-emphasis, framing (e.g., 25–32 ms with 10–16 ms hops), Hamming window, DFT/FFT, mel filterbanks (M=40), log-scaling, and DCT-II for K (commonly 13) coefficients per frame (Harere et al., 2023, Salman et al., 25 Jan 2026). Δ and ΔΔ derivatives (deltas) track dynamics.
  • Mel-Spectrograms: STFT (e.g., N=1024/H=256), mel filters (e.g., M=224 over 0–4,000 Hz), log-scaling, and normalization to produce 2D matrices suitable for CNN ingestion (Shaiakhmetov et al., 30 Mar 2025).
  • Phoneme Alignment: Forced aligners (e.g., Montreal Forced Aligner), VAD, and sliding-window ASR (Whisper-Quran, Tarteel AI) segment continuous recitation at waqf (pausal marks) and produce time-aligned phonemic/word boundaries (Abdelfattah et al., 27 Aug 2025), supporting accurate mapping of script-encoded Tajweed events to corresponding audio intervals.

Orthographic analysis pipelines (e.g., “tajweediser”) utilize context-sensitive rewrite rules based on Unicode diacritics, part-of-speech mapping, and explicit regular expressions to add or remove the Tajweed layer in CQO, with round-trip verification guaranteeing 100% recall/precision on the Cairo Edition (Martínez, 16 May 2025).

4. Model Architectures and Rule-Specific Detection Algorithms

Modeling approaches have evolved from shallow per-rule classifiers to holistic neural sequence models:

  • Traditional ML: RBF-SVMs trained per rule on filter-bank energies with supervised thresholds offer high accuracy (≈99%) for a handful of rules but lack scalability and context awareness (Alagrami et al., 2020).
  • Deep Networks: LSTM and BiLSTM architectures process MFCC time series, capturing temporal dependencies and yielding accuracy up to 96% (per rule) on QDAT, with significant (2–24 point) gains over RF/KNN/SVM baselines (Harere et al., 2023, Salman et al., 25 Jan 2026). CNNs on spectrogram patches achieve Macro-F1 ≈ 0.75.
  • EfficientNet-B0 with SE Attention: Deeper spectral encoders, such as EfficientNet-B0 with Squeeze-and-Excitation, further improve accuracy (Madd: 95.35%, Ghunnah: 99.34%, Ikhfā’: 97.01%) and robustness, enabling real-time or interactive learning deployments (Shaiakhmetov et al., 30 Mar 2025).
  • SSL-based and Hybrid Models: Transformer-based encoders (wav2vec2.0, HuBERT, WavLM, mHuBERT) with frozen parameters followed by BiLSTM and CTC, trained for phoneme sequence prediction, underpin the best-performing pronunciation benchmark systems (phoneme error rate PER down to 0.16% for Tajweed-aware QPS tasks; F1 ≤30% for unconstrained error diagnosis under QuranMB.v1) (Kheir et al., 9 Jun 2025, Abdelfattah et al., 27 Aug 2025).

Scoring functions for key rules mathematically formalize duration, spectral, or similarity constraints:

  • Madd: deviation from canonical duration
  • Qalqalah: echo energy ratio in a signature frequency band
  • Idghām: MFCC centroid cosine similarity

Rule graphs (FSMs) are used as explicit representations of permissible Tajweed transitions for inferencing and error localization (Al-Kharusi et al., 14 Oct 2025).

5. Performance Metrics, Evaluation Frameworks, and Error Analyses

Tajweed detection systems are evaluated on per-rule and per-language-segment metrics:

Absolute performance remains modest for open-vocabulary, multi-rule detection (F1 ≤ 0.3), reflecting acoustic and linguistic complexity (Kheir et al., 9 Jun 2025).

6. Systemic Challenges and Future Directions

Tajweed detection systems face several structural challenges:

  • Data Scarcity and Representation: Limitations in public, fine-grained, Tajweed-annotated corpora and demographic diversity hinder generalizability. Automated corpus generation (e.g., Abdelfttah's 98% automated pipeline) and phonetization with QPS address scale and script fidelity barriers (Abdelfattah et al., 27 Aug 2025).
  • Rule Coverage and Localization: Most published work addresses ≤3 rules on isolated verses; expanding to all 14+ canonical rules over continuous recitation remains a priority (Harere et al., 2023, Shaiakhmetov et al., 30 Mar 2025).
  • Model-Data Objective Alignment: Conventional ASR/CTC models optimize word or phone accuracy, not compliance with Tajweed rules. Knowledge-centric frameworks advocate for embedding the rule canon and articulation graphs into inference engines, aligning scoring with rule-specific parameters (duration, spectral profile, etc.) rather than transcription alone (Al-Kharusi et al., 14 Oct 2025).
  • Interpretability and Pedagogical Feedback: Advanced systems map detected errors to explanatory, localized, and actionable feedback for learners, supporting individualized pedagogy and progress tracking (Shaiakhmetov et al., 30 Mar 2025, Abdelfattah et al., 27 Aug 2025, Al-Kharusi et al., 14 Oct 2025).
  • Hybrid Model Integration: Future architectures are expected to fuse deep acoustic encoders (SSL) with explicit rule graphs, supporting multi-task learning (phoneme+rule), adversarial or contrastive objectives, and expert-validated, demographically inclusive datasets.

A plausible implication is that robust, pedagogically faithful Tajweed detection will demand explicit rule encoding at inference time, tight script-audio integration, and multi-source supervision spanning phonetic, orthographic, and expert annotation domains (Al-Kharusi et al., 14 Oct 2025, Martínez, 16 May 2025).

7. Applications, Benchmarks, and Theological/Rigorous Validation

Applications of Tajweed detection span:

Open questions remain regarding the full integration of knowledge-driven rule graphs with neural architectures, standardized cross-dataset evaluation protocols, inclusion of multi-dialect and L2 speaker data, and theological vetting of automated system recommendations. Future advances are expected to emerge from hybrid, knowledge-infused systems tightly aligned to the phonological and canonical authority of the Quranic text (Al-Kharusi et al., 14 Oct 2025, Martínez, 16 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tajweed Detection.