Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations

Published 8 May 2025 in cs.CL and cs.AI | (2505.05056v1)

Abstract: This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.

Abstract PDF Upgrade to Chat

Summary

Overview of Teochew-Wild Speech Corpus

The paper titled "Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations" introduces a novel dataset aimed at bridging the resource gap for the Teochew dialect, a significantly underrepresented language in current speech processing research. The Teochew-Wild corpus provides 18.9 hours of speech data featuring the Teochew dialect with both orthographic and phonemic annotations, collected from various speakers across multiple contexts. As the authors note, this is the first publicly available dataset for the Teochew dialect to offer such comprehensive annotations, making it valuable for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) applications.

Key Contributions and Claims

The dataset marks a significant advancement in linguistic resources for Teochew, offering a diverse range of expressions from formal village-centric speech to casual colloquial dialogues. Leveraging this diversity, the authors of the paper have developed associated text processing tools to handle polyphonic character disambiguation, Mandarin-Teochew vocabulary mapping, and Grapheme-to-Phoneme (G2P) conversion—all essential functionalities for linguistic preprocessing when working with tonal languages like Teochew. Moreover, the authors have supplemented and refined the orthographic system for Teochew to accommodate unique phonetic pronunciations that may not have corresponding Chinese characters.

Experimental Validation

In validating the efficacy of the Teochew-Wild corpus, the authors conducted TTS and ASR experiments using both autoregressive (AR) and non-autoregressive (NAR) models. They highlighted the superior performance of Tacotron2 for speech synthesis applications due to its effective contextual representation, as evidenced by its high Mean Opinion Score (MOS). ASR experiments indicated competitive Character Error Rates (CER) and Word Error Rates (WER) for models trained with the Teochew-Wild data. Notably, Whisper-medium model fine-tuning yielded promising results, reflecting the adaptability of large pre-trained models to low-resource languages.

Practical and Theoretical Implications

From a practical standpoint, the Teochew-Wild corpus enriches existing speech technologies by providing essential linguistic data that can lead to more robust and expressive speech synthesis for Teochew speakers, potentially aiding in cultural preservation and revitalization efforts. The tools and processes developed alongside the corpus facilitate more precise linguistic analysis and application, combating existing misconceptions regarding the potential interchangeability of Teochew and Hokkien languages.

Theoretically, this dataset opens avenues for further research into cross-dialectal speech transformations and synthesis, pushing current models to accommodate intricate tonal variations and polyphonic challenges intrinsic to the Teochew dialect. Additionally, the Teochew-Wild corpus serves as a benchmark for future datasets targeting low-resource languages, stressing the importance of diverse orthographic representations.

Future Directions in AI

The availability of the Teochew-Wild dataset exemplifies the necessity for targeted efforts in dataset construction for low-resource languages, particularly those with distinct dialectal variations. It sets a precedent for subsequent projects focusing on the linguistic intricacies of other underrepresented dialects. Advancements in AI, specifically in areas such as transfer learning and zero-shot learning, could leverage such datasets to enhance cross-linguistic model adaptability, thereby contributing to more universally applicable speech processing technologies.

In conclusion, the Teochew-Wild corpus represents a meaningful addition to the realm of computational linguistic resources, underscoring both the practical and theoretical impact of dialect-specific datasets on the future of AI-driven speech processing.