Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm

Published 23 Apr 2024 in eess.AS | (2406.02561v3)

Abstract: End-to-end transformer-based models epitomize the cutting-edge in Automatic Speech Recognition (ASR) systems. Despite their substantial benefits, these models demand extensive training data to perform optimally, presenting a significant challenge for low-resource languages such as Central Kurdish. Addressing this issue requires innovative methods and techniques. This paper aims to develop an ASR system for Intermediate Kurdish by collecting a robust corpus of speech, using the N-GRAM LLM, and utilizing an external Kurdish tokenizer for refinement and integration techniques to enhance the model's performance. We collect a comprehensive 100-hour speech corpus from diverse sources. Additionally, applied fine-tuning techniques to our speech corpus on Persian, English, and Arabic pre-trained models, specifically utilizing the xls-r-300m, xls-r-1b, and xls-r-2b Wav2vec 2.0 models. And utilized LLMs trained by 3-gram and 4-gram from a large text corpus of 300 million tokens. The fine-tuned xls-r-2b model, combined with a 3-gram LLM and included external Kurdish tokenizer, achieved the best performance, yielding a Word Error Rate (WER) of 10.0% on the validation set and 11.8% on the Asosoft test set. The ASR model has demonstrated the advantages of having a large vocabulary compared to the existing Kurdish ASR models. Compared to other models, it produced more accurate and higher performance outcomes by working with a lower error rate.