Scalability of TTS-based augmentation as a substitute for real learner speech

Determine whether text-to-speech-based mispronunciation augmentation can scale to effectively substitute for real learner speech when training Modern Standard Arabic mispronunciation detection and diagnosis systems, thereby guiding data collection and augmentation strategies for Arabic pronunciation assessment.

Background

The challenge released multiple training corpora, including Iqra_TTS (synthetic speech with injected mispronunciations) and Iqra_Extra_IS26 (1,333 utterances of authentic human mispronounced speech). Across top submissions, models that incorporated the small authentic mispronunciation corpus consistently outperformed those relying primarily on larger synthetic datasets.

This empirical gap prompted the authors to explicitly flag an open question about whether scaling TTS-based mispronunciation augmentation can truly replace real learner speech for this task, motivating a need to clarify the role and limits of synthetic data in Arabic MDD.

References

The IQRA 2026 challenge marks a significant milestone in Arabic pronunciation assessment, yet the results also surface important open questions that the community must address to move this research toward real-world impact. Despite Iqra_Extra_IS26 containing only 1,333 utterances, systems that carefully leveraged it consistently outperformed those relying on far larger synthetic corpora. This raises a fundamental question about the scalability of TTS-based augmentation as a substitute for real learner speech, and motivates investment in larger-scale human data collection as the primary bottleneck for future progress.

IQRA 2026: Interspeech Challenge on Automatic Assessment Pronunciation for Modern Standard Arabic (MSA)  (2603.29087 - Kheir et al., 31 Mar 2026) in Section 6 (Discussion), first and second paragraphs