- The paper introduces a character-level transformer model for Bangla text-to-IPA transcription, achieving a WER of 0.10582.
- The model leverages sequence alignment and extensive preprocessing to handle punctuation, foreign words, and numerals effectively.
- The study demonstrates significant improvements in transcription accuracy, offering practical benefits for NLP and speech technology applications.
Character-Level Bangla Text-to-IPA Transcription Using Transformer Architecture with Sequence Alignment
Introduction
This paper presents a study on Bangla text-to-IPA transcription utilizing a transformer-based sequence-to-sequence model. Recognizing the phonetic intricacies of Bangla, one of the most widely spoken languages globally, the authors focus on enhancing the transcription accuracy by leveraging advanced ML and AI methodologies. The traditional IPA mapping, crucial in various linguistic and technological contexts, is augmented in this study through an innovative use of transformer architecture. The paper targets both theoretical enhancements and practical applications, aiming to contribute to diverse fields such as language learning, speech therapy, and the development of text-to-speech systems.
Methodology
Dataset and Preprocessing
The dataset for this study derives from the DataVerse Challenge - ITVerse 2023, consisting of Bangla text and their corresponding IPA transcriptions. The training dataset includes 21,999 samples, and the test dataset contains 27,228 samples. A thorough analysis identified unique characters and handled variations between text and IPA alignments, focusing on optimizing the training process through data-driven insights. The dataset's character-level details, such as the histogram of word counts, were meticulously evaluated to refine the model's input.
Figure 1: Word count histogram of training dataset
Model Architecture
The study employs a simplified transformer model with a single encoder and decoder layer, consisting of 8.5 million parameters. The model's design is tailored to the task's requirements, focusing on character-level transcription to accommodate the high variance in Bangla's phonetic representation. Extensive preprocessing augments the model's efficacy by handling punctuation marks, foreign words, and numerals, optimizing it for practical applications while minimizing computational overhead.
Training and Inference
The model is trained with a focus on enhancing accuracy and reducing the word error rate (WER), achieving a top position in the public leaderboard of the DataVerse Challenge with a WER of 0.10582. Training involves a detailed tuning of hyperparameters, with a focus on stability and performance across varied data subsets. The inference process incorporates a dictionary for efficient IPA mapping, leveraging previously computed results to improve speed and resource utilization.
Results and Analysis
The model's performance is evaluated through iterative enhancements, addressing challenges specific to Bangla text, such as handling punctuation and foreign language integration. The study reports significant reductions in WER, showcasing the effectiveness of various preprocessing and handling strategies. Comparative results against baseline and enhanced models demonstrate the system's robustness and potential for real-world application.
Figure 2: Architecture of our system
Conclusion
The research illuminates the potential of transformer architectures in handling complex phonetic transcription tasks. By focusing on the Bangla language, the study not only advances linguistic research but also highlights the broader applicability of such models in NLP tasks with similar challenges. Future work could expand the dataset to include more phonetic variations and explore more sophisticated model architectures or hybrid approaches, contributing further to the field's development. The study's findings have significant implications for AI-driven linguistic tools, offering a pathway toward more accurate and efficient systems.