Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

Published 27 Jul 2020 in eess.AS, cs.LG, cs.SD, and stat.ML | (2007.13465v2)

Abstract: We propose a self-supervised representation learning model for the task of unsupervised phoneme boundary detection. The model is a convolutional neural network that operates directly on the raw waveform. It is optimized to identify spectral changes in the signal using the Noise-Contrastive Estimation principle. At test time, a peak detection algorithm is applied over the model outputs to produce the final boundaries. As such, the proposed model is trained in a fully unsupervised manner with no manual annotations in the form of target boundaries nor phonetic transcriptions. We compare the proposed approach to several unsupervised baselines using both TIMIT and Buckeye corpora. Results suggest that our approach surpasses the baseline models and reaches state-of-the-art performance on both data sets. Furthermore, we experimented with expanding the training set with additional examples from the Librispeech corpus. We evaluated the resulting model on distributions and languages that were not seen during the training phase (English, Hebrew and German) and showed that utilizing additional untranscribed data is beneficial for model performance.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (71)

View on Semantic Scholar

Summary

The paper introduces a novel self-supervised contrastive model that segments phonemes from raw audio without relying on annotated data.
The methodology leverages convolutional neural networks and peak detection to achieve F1-scores of 83.71 on TIMIT and 76.31 on Buckeye datasets.
Experimental results reveal that incorporating additional unlabeled data improves cross-lingual performance, suggesting the model’s adaptability across languages.

Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation

The presented paper explores a novel approach to the problem of unsupervised phoneme boundary detection, leveraging self-supervised contrastive learning. The proposed architecture consists of a convolutional neural network designed to operate directly on raw audio waveforms. Using the Noise-Contrastive Estimation (NCE) principle, the model is optimized to identify spectral changes within the audio signal and detect phoneme boundaries. This approach bypasses the need for manually annotated data, such as phonetic transcriptions or target boundaries, enabling a fully unsupervised training process.

Methodology

The architecture employs a series of convolution operations designed to capture the spectral features of speech. Unlike many previous models in phoneme segmentation, the architecture does not include a context network, as empirical tests indicated that its inclusion led to inferior performance. The training objective is to differentiate between pairs of adjoining audio frames and randomly chosen distractor frames, a practice drawn from successful strategies in self-supervised learning in other domains, such as natural language processing and computer vision.

During inference, the model's outputs undergo a peak detection algorithm to yield final phoneme boundary predictions. The model's performance was evaluated using the TIMIT and Buckeye datasets, against which it was benchmarked against existing unsupervised methods, ultimately achieving state-of-the-art results.

Experimental Results

The experiments underscore the efficacy of the self-supervised model. The model performed better than existing unsupervised baselines on the TIMIT and Buckeye datasets, as measured by precision, recall, F1-score, and R-value. Notably, the proposed method achieved an F1-score of 83.71 and 76.31 on TIMIT and Buckeye datasets respectively, establishing its competence over the benchmarked unsupervised models.

An interesting facet of the research was the exploration of the impact of additional unlabeled training data. Incorporating supplementary data from the Librispeech corpus showed marginal benefits when test and training data are from matched distributions. In contrast, a notable improvement was observed when they were from different distributions or languages. This cross-lingual evaluation suggests that the learned representations are not language-specific, highlighting the potential generalizability of the model across different languages.

Implications and Future Work

The paper presents important contributions to the field of unsupervised speech processing. The reliance on self-supervised learning frameworks demonstrates the potential for phoneme segmentation without extensive annotation. Such approaches can drastically lower the barriers for phoneme segmentation tasks in low-resource languages or underrepresented domains.

Future directions proposed include the exploration of a semi-supervised setup, where a small amount of labeled data could potentially refine the model's performance further. Additionally, applying this unsupervised methodology to real-world speech recognition tasks or adapting the model to operate under variable acoustic conditions might enhance its usability in diverse research and application settings.

The discovery elucidates the possibilities self-supervised learning frameworks offer in phonetic studies, presenting a robust paradigm for unsupervised segmentation that warrants further exploration. The documented success across multiple datasets and languages points towards further research avenues in unsupervised ASR systems, potentially impacting a broad spectrum of speech technology applications.

Markdown Report Issue