- The paper introduces a public dataset with 104 hours of spontaneous Mandarin-English dialogues to improve ASR for natural code-switching.
- It details rigorous data acquisition and transcription protocols from native speakers, ensuring high fidelity and ethical standards.
- Experimental results show Conformer models excel with this dataset while highlighting the ongoing challenges in code-switching ASR.
CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
Introduction
The paper introduces CS-Dialogue, a sizable dataset containing 104 hours of spontaneous Mandarin-English code-switching dialogues for advancing automatic speech recognition (ASR). This dataset addresses three primary challenges: the paucity of spontaneous data, limited dataset sizes, and incomplete dialogue transcriptions in existing resources, which together hinder the development of robust ASR models suitable for real-world applications.
Dataset Creation
Data Acquisition
The dataset comprises recordings from 200 native Chinese speakers with high English proficiency, over a set of seven topics: personal topics, entertainment, technology, education, job, philosophy, and sports. Conversations were strategically structured to progress through monolingual Mandarin, code-switching, and monolingual English, contributing a balanced linguistic distribution. Ethical considerations were meticulously observed, with participants providing informed consent under established ethical guidelines.
Annotation Process
The dataset's annotation process was rigorous, ensuring high transcription fidelity. Transcriptions strictly adhered to the actual spoken content, preserving disfluencies and local accents. The symbols used for annotating non-lexical events detailed acoustic phenomena, enhancing the dataset's utility in training noise-resilient ASR systems.
Dataset Overview
CS-Dialogue is distinguished by its capture of full-length dialogues rather than isolated sentences, facilitating the study of natural speech patterns and contextual dependencies. The dataset is categorized into train, development, and test splits, ensuring that each segment retains a representative linguistic distribution. Speaker demographics further enhance the dataset's diversity, providing a broad geographic and age representation from across China.
Experimental Evaluation
Baseline ASR Models
The paper evaluates several ASR models trained from scratch, including Transformer, Conformer, and Branchformer, and finds that Conformer models performed best due to their ability to capture both local and global contexts.
State-of-the-art pre-trained models, such as Whisper and Qwen2-Audio, were assessed both in zero-shot settings and after fine-tuning on CS-Dialogue. The fine-tuning process significantly enhanced model performance, particularly for larger models, reflecting the critical need for specialized datasets in training ASR systems for code-switching tasks.
Discussion
The experimental results underscore the complexity of code-switching in ASR even for robust pre-trained models, revealing substantial error rates and emphasizing the need for continued refinement in model architectures and training techniques. The heterogeneous performance across topics indicates potential for further enhancements through domain-specific adaptations.
Conclusion
CS-Dialogue provides an extensive, high-quality resource for advancing research into code-switching ASR. By filling existing dataset gaps, it enables the development of more sophisticated ASR systems capable of recognizing and processing natural code-switching speech patterns. Moreover, its release as an open-access dataset represents a significant step forward for the academic community, offering valuable benchmarks and fostering dialogue-oriented speech processing innovations.
Future work should focus on broadening the dataset's application by extending to other language pairs and improving the simulation of real-world recording environments, thereby further supporting multilingual and multicultural ASR applications.