- The paper introduces a comprehensive multi-source dataset and evaluation protocol that achieves a significant F1-score improvement in mispronunciation detection for MSA.
- The paper demonstrates that incorporating authentic mispronunciation data improves performance over synthetic speech, emphasizing the need for real error corpora.
- The paper reveals that advanced temporal alignment techniques and emerging generative LALMs can overcome Arabic phoneme mapping challenges and enhance feedback precision.
IQRA 2026: Advances in Automatic Pronunciation Assessment for Modern Standard Arabic
Background and Motivation
The IQRA 2026 Interspeech Challenge addresses persistent gaps in the automatic assessment of pronunciation for Modern Standard Arabic (MSA), focusing specifically on mispronunciation detection and diagnosis (MDD). Despite significant progress in CAPT and MDD for English and Mandarin, Arabic has remained underserved due to lack of open annotated datasets, standardized phoneme inventories, and reproducible evaluation protocols. The phonological complexity of Arabic—34 phonemes, including rare uvular and pharyngeal and emphatic/non-emphatic consonant distinctions—poses considerable modeling challenges, further complicated by diglossia and L1 interference in MSA acquisition by native dialect speakers. Prior shared tasks in Arabic MDD achieved limited performance, mainly due to reliance on small or synthetic corpora [el2025iqra].
Challenge Design and Resources
IQRA 2026 introduces a comprehensive evaluation campaign with a fourfold dataset design:
- Iqra_train: 79 hours of real Arabic speech, fully vowelized and phonetized, integrating Common Voice v12.0 and Qur'anic recitations, with broad demographic coverage and open access.
- Iqra_TTS: 52 hours of synthetic speech from seven TTS engines, split between canonical and systematically mispronounced utterances via phoneme confusion matrices.
- Iqra_Extra_IS26: 1,333 utterances of real human mispronounced speech, collected explicitly for authentic error modeling—filling a critical gap absent in previous editions.
- QuranMB.v2: An expanded evaluation benchmark with detailed phoneme-level human annotation.
The task is formulated as sequence-level phoneme prediction aligned against both canonical and verbatim transcript sequences, with the primary metric being F1-score over correctly identified mispronunciations. Additional metrics include Correct Diagnosis rate (CD), phoneme error rate (PER), and precision-recall trade-offs.
Baseline and Submitted Approaches
The provided baseline uses a frozen multilingual mHuBERT encoder (SSL, 94M parameters), layer aggregation, and a Bi-LSTM with CTC loss. It achieves F1 = 0.4414 on QuranMB.v2.
Submitted systems from 19 international teams spanned diverse architectures:
- Temporal Alignment Enhancement: whu-iasp combines wav2vec2-xls-r-300m with TCN and multi-stage curriculum, exploiting confusion networks and n-gram LM rescoring. UTokyo proposes CROTTC, using optimal transport for dense frame-level alignments and regularization via self-distillation. These models directly tackle alignment-based bottlenecks in phoneme sequence prediction.
- Data Quality and Reliability: RAM employs inter-model agreement filtering and targeted label curation, leveraging manual review for high-error instances and filtering synthetic speech by ASR agreement.
- Hybrid and Generative Paradigms: SQZ_ww fuses U2++ Conformer with Transformer decoder via joint CTC/attention loss, augmented by synthetic error generation and two-pass decoding. Najva adapts FastConformer, transitioning outputs from grapheme to phoneme via phonetization and variant normalization, with stage-wise specialization on authentic error data.
- Parameter-Efficient Generative Modeling: Kalimat demonstrates the first use of generative LALMs for direct speech-to-phoneme generation, employing LoRA adaptation for decoder parameters and fully frozen encoders, achieving competitive results despite optimizing only a subset of parameters.
Numerical Results and Analysis
The best system (whu-iasp) sets F1 = 0.7201, outperforming the baseline by +0.2787 absolute—a more than twofold improvement over prior editions (F1 ≈ 0.30). All top-six systems achieved F1 > 0.67 and PER ≤ 0.0445, with performance tightly correlated to data quality and architectural sophistication.
A marked precision–recall trade-off emerges: systems below the baseline display excessively high recall but poor precision—effectively flagging nearly all phonemes as errors, a sign of uncalibrated detection. Top systems maintain balanced precision and recall (e.g., whu-iasp: 0.7416/0.6998), indicative of well-calibrated discrimination between canonical and erroneous pronunciations.
Crucially, systems incorporating Iqra_Extra_IS26 authentic mispronunciation data consistently outperformed those relying solely on synthetic augmentation, affirming the necessity of real error corpora. Even with limited utterances, authentic data drove both modeling gains and generalizability across evaluation sets.
Architecturally, enhancements in temporal alignment and integration of frame-level error modeling emerged as the principal levers for improved performance. Generative LALMs—while not yet matching SOTA—demonstrated viable performance and potential for future scalability.
Implications, Open Challenges, and Future Directions
IQRA 2026 substantiates the value of open, authentic mispronunciation corpora and refined alignment techniques for Arabic MDD. However, several pressing challenges remain:
- Data Scalability: Results strongly suggest that TTS-based augmentation is insufficient for error modeling, motivating larger-scale human mispronunciation collection, especially from L2 learners.
- Phoneme–Character Mapping: Current systems operate at phoneme-level granularity, but actionable learner feedback requires mapping to Arabic script characters and diacritics—a nontrivial correspondence given Arabic grapheme–phoneme complexity. Bridging this gap is essential for practical CAPT deployment.
- Task Reformulation: Generative LALMs enable direct error feedback and explanation, bypassing the phoneme–character mapping bottleneck. Realizing this vision requires richer supervision, annotation, and evaluation methods beyond F1 and PER.
- Community Infrastructure: The challenge fosters reproducibility and openness, but future progress depends on standardized protocols, expanded leaderboards, and continued collaboration for larger-scale benchmarking.
Conclusion
IQRA 2026 sets new benchmarks in Arabic MDD by introducing the first corpus of authentic human mispronounced MSA speech and achieves dramatic improvement in phoneme error detection and diagnosis metrics. The impact of real mispronunciation data and innovative alignment strategies is consistently validated across approaches. Future research should focus on large-scale authentic data acquisition, robust phoneme-to-character mapping, and next-generation generative feedback systems, with the long-term goal of actionable, script-level pronunciation guidance for Arabic learners.
Reference: "IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)" (2603.29087)