THCHS-30 : A Free Chinese Speech Corpus

Published 7 Dec 2015 in cs.CL and cs.SD | (1512.01882v2)

Abstract: Speech data is crucially important for speech recognition research. There are quite some speech databases that can be purchased at prices that are reasonable for most research institutes. However, for young people who just start research activities or those who just gain initial interest in this direction, the cost for data is still an annoying barrier. We support the `free data' movement in speech recognition: research institutes (particularly supported by public funds) publish their data freely so that new researchers can obtain sufficient data to kick of their career. In this paper, we follow this trend and release a free Chinese speech database THCHS-30 that can be used to build a full- edged Chinese speech recognition system. We report the baseline system established with this database, including the performance under highly noisy conditions.

Abstract PDF Upgrade to Chat

Citations (223)

View on Semantic Scholar

Summary

The paper introduces THCHS-30, a free Chinese speech corpus with over 30 hours of data recorded from 50 speakers to support ASR research.
The authors detail a baseline HMM-DNN system using Kaldi that achieves a CER of 30.11% and a PER of 14.81%, with a denoising auto-encoder boosting performance in noisy conditions.
The release of THCHS-30 democratizes ASR research by providing comprehensive resources, enhancing reproducibility, and advancing noise-robust modeling techniques.

Overview of "THCHS-30: A Free Chinese Speech Corpus"

The paper "THCHS-30: A Free Chinese Speech Corpus" by Dong Wang and Xuewei Zhang addresses a significant barrier in the field of Automatic Speech Recognition (ASR): access to large, high-quality speech datasets. The authors contribute to the 'free data' movement by releasing THCHS-30, a Chinese speech database designed to facilitate ASR research and innovation, particularly for resource-limited researchers and institutions.

Contribution Highlights

THCHS-30 emerges as one of the first free-to-access Chinese corpora that aim to support the construction of comprehensive Chinese ASR systems. It comprises over 30 hours of speech data recorded from 50 participants. Accompanied by extensive resources such as lexica, LMs, and training recipes, THCHS-30 offers a complete toolkit to build a large vocabulary continuous speech recognition system. The corpus is characterized by a focus on diversity in phone coverage, enhancing its utility in testing and developing robust speech recognition models.

Baseline System and Results

The authors present a baseline ASR system developed using the THCHS-30 corpus with the Kaldi toolkit, employing a hidden Markov model-deep neural network (HMM-DNN) architecture. The initial results indicate a character error rate (CER) of 30.11% and a phone error rate (PER) of 14.81% on clean test data. However, performance significantly deteriorated under noisy conditions, a common challenge in ASR systems. Importantly, the application of a denoising auto-encoder (DAE) improved recognition accuracy under noise, showcasing a practical solution to the noise-related challenges in ASR.

Implications and Future Directions

This paper's release of THCHS-30 offers a pivotal resource for stimulating Chinese speech recognition research and lowers the entry barrier for researchers lacking substantial funding. The availability of a free corpus with comprehensive resources is expected to enhance reproducibility and comparability across research outputs, facilitating more objective evaluations of different models and techniques.

Furthermore, the application of DAE reflects an emerging direction in noise-robust ASR contexts that merit further exploration. The promising results from this approach suggest that continuous advancements in noise cancellation methods could yield substantial improvements in the practical deployment of ASR systems across diverse environments.

Speculation on Future Developments

Looking forward, the provision of resources like THCHS-30 may lead to innovations that incorporate novel machine learning techniques, such as more advanced neural architecture designs or unsupervised learning paradigms to enhance model performance. Additionally, there could be a push towards sourcing real-time data collection and annotation methods that mitigate the cost barriers associated with creating large-scale datasets. Expanding on DAEs, future work may explore integrating more sophisticated noise-specific modeling or exploring adversarial approaches for robustness across various real-world conditions.

Overall, THCHS-30 advances the democratization of resources crucial to the field of ASR and stands as a testament to a shared commitment to collaborative progress in speech technology research.

Markdown Report Issue