Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Published 26 Oct 2020 in cs.CL | (2010.13826v1)

Abstract: Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised LLMs, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.

Abstract PDF Upgrade to Chat

Citations (57)

View on Semantic Scholar

Summary

The paper introduces a novel framework that integrates self-supervised pretraining for speech and language models to address limitations in traditional SLU systems.
It leverages both supervised and unsupervised training, using components like wav2vec and BERT to significantly improve word error rates and intent-slot accuracy on the ATIS dataset.
The study demonstrates robust performance in noisy environments, paving the way for more adaptable and resource-efficient spoken language understanding systems.

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and LLM Pretraining

The paper by Cheng-I Lai et al. introduces a novel framework for semi-supervised Spoken Language Understanding (SLU) that leverages the advancements in self-supervised learning for both speech and LLMs. This approach addresses several limitations commonly found in existing SLU systems, including the reliance on oracle text inputs, the focus on intent prediction excluding slot values, and the dependency on large in-house datasets for training.

Framework and Methodology

The proposed framework is structured around pretrained end-to-end ASR models and self-supervised LLMs, such as BERT. The framework incorporates two semi-supervised training paradigms for the Automatic Speech Recognition (ASR) component:

Supervised Pretraining: Utilizing transcribed speech for ASR subword sequence prediction.
Unsupervised Pretraining: Adopting self-supervised speech representations like wav2vec, which allows training on untranscribed audio data.

The study evaluates SLU models on two critical criteria: robustness to environmental noise and end-to-end semantic evaluation. These aspects ensure that the model can effectively work in realistic settings where noise and limited labeled data are prevalent challenges.

Experimental Results

The evaluation on the ATIS dataset demonstrates that the framework, using speech inputs, achieves comparable semantic understanding performance to systems operating on oracle text inputs. Specifically, significant improvements in Word Error Rates (WER) and intent classification/slot labeling accuracy are observed when compared to previous SLU techniques. These improvements are achieved despite the presence of environmental noise, showcasing the robustness of the proposed method. Furthermore, the model trained with noise augmentation maintains high performance, indicating its resilience in practical use cases.

Implications and Future Directions

The proposed framework successfully integrates self-supervised pretraining into the SLU pipeline, highlighting the potential for reducing reliance on extensive labeled datasets. This presents a significant step towards more adaptable and accessible SLU systems, particularly for resource-limited languages where labeled data is scarce. The study also suggests exploring cross-lingual SLU frameworks and creating more comprehensive benchmarks that extend beyond controlled datasets like ATIS.

Future research could expand the framework's applications across various domains and explore its integration with multi-modal systems that combine audio with other data forms. Additionally, advancements in self-supervised learning could further enhance the SLU capabilities by improving the generalizability and transferability of pretrained models across different languages and tasks.

In conclusion, the paper presents a substantial contribution to SLU methodologies by addressing critical limitations and demonstrating the efficacy of integrating self-supervised pretraining in enhancing SLU performance under semi-supervised settings. This approach is pivotal in bridging the gap between ASR and NLU, fostering the development of more robust and versatile spoken language systems.