CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

Published 26 Feb 2025 in cs.CL, cs.SD, and eess.AS | (2502.18913v2)

Abstract: Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a public dataset with 104 hours of spontaneous Mandarin-English dialogues to improve ASR for natural code-switching.
It details rigorous data acquisition and transcription protocols from native speakers, ensuring high fidelity and ethical standards.
Experimental results show Conformer models excel with this dataset while highlighting the ongoing challenges in code-switching ASR.

CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition

Introduction

The paper introduces CS-Dialogue, a sizable dataset containing 104 hours of spontaneous Mandarin-English code-switching dialogues for advancing automatic speech recognition (ASR). This dataset addresses three primary challenges: the paucity of spontaneous data, limited dataset sizes, and incomplete dialogue transcriptions in existing resources, which together hinder the development of robust ASR models suitable for real-world applications.

Dataset Creation

Data Acquisition

The dataset comprises recordings from 200 native Chinese speakers with high English proficiency, over a set of seven topics: personal topics, entertainment, technology, education, job, philosophy, and sports. Conversations were strategically structured to progress through monolingual Mandarin, code-switching, and monolingual English, contributing a balanced linguistic distribution. Ethical considerations were meticulously observed, with participants providing informed consent under established ethical guidelines.

Annotation Process

The dataset's annotation process was rigorous, ensuring high transcription fidelity. Transcriptions strictly adhered to the actual spoken content, preserving disfluencies and local accents. The symbols used for annotating non-lexical events detailed acoustic phenomena, enhancing the dataset's utility in training noise-resilient ASR systems.

Dataset Overview

CS-Dialogue is distinguished by its capture of full-length dialogues rather than isolated sentences, facilitating the study of natural speech patterns and contextual dependencies. The dataset is categorized into train, development, and test splits, ensuring that each segment retains a representative linguistic distribution. Speaker demographics further enhance the dataset's diversity, providing a broad geographic and age representation from across China.

Experimental Evaluation

Baseline ASR Models

The paper evaluates several ASR models trained from scratch, including Transformer, Conformer, and Branchformer, and finds that Conformer models performed best due to their ability to capture both local and global contexts.

Pre-trained Model Performance

State-of-the-art pre-trained models, such as Whisper and Qwen2-Audio, were assessed both in zero-shot settings and after fine-tuning on CS-Dialogue. The fine-tuning process significantly enhanced model performance, particularly for larger models, reflecting the critical need for specialized datasets in training ASR systems for code-switching tasks.

Discussion

The experimental results underscore the complexity of code-switching in ASR even for robust pre-trained models, revealing substantial error rates and emphasizing the need for continued refinement in model architectures and training techniques. The heterogeneous performance across topics indicates potential for further enhancements through domain-specific adaptations.

Conclusion

CS-Dialogue provides an extensive, high-quality resource for advancing research into code-switching ASR. By filling existing dataset gaps, it enables the development of more sophisticated ASR systems capable of recognizing and processing natural code-switching speech patterns. Moreover, its release as an open-access dataset represents a significant step forward for the academic community, offering valuable benchmarks and fostering dialogue-oriented speech processing innovations.

Future work should focus on broadening the dataset's application by extending to other language pairs and improving the simulation of real-world recording environments, thereby further supporting multilingual and multicultural ASR applications.

Markdown Report Issue