Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Published 8 Nov 2023 in cs.CL, cs.SD, and eess.AS | (2311.04534v2)

Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld

Abstract PDF Upgrade to Chat

Citations (7)

View on Semantic Scholar

Summary

The paper introduces Smoothed Label Distillation (SLD) as an alternative to loss masking for more effectively capturing dependencies in speech token modeling.
It employs a KL divergence loss with smoothed labels to mitigate discretization noise and enhance autoregressive modeling in ASR.
Experiments on the LibriSpeech corpus reveal that SLD reduces word error rates, outperforming traditional methods.

Insights into Loss Masking Elimination in Decoder-Only Transformers for Discrete-Token-Based ASR

This paper explores the intricate aspects of enhancing automatic speech recognition (ASR) by refining the training methodology in discrete-token-based ASR systems. The researchers focus on the critical assessment of loss masking strategies and propose an alternative, Smoothed Label Distillation (SLD), to better capture dependencies in speech token modeling.

Recent developments in unified speech-text models such as SpeechGPT, VioLA, and AudioPaLM have capitalized on overlapping speech and text processing frameworks using discrete tokens and decoder-only Transformer architectures. These models, however, lean on Loss Masking to overlook inter-token dependencies among speech tokens, potentially diminishing the efficacy in capturing nuanced information required for robust ASR modeling.

Key Findings

Autoregressive Modeling of Speech Tokens: The authors investigate the potential of modeling speech tokens autoregressively akin to text. Despite theoretical appeals, naive application of cross-entropy loss on speech tokens fails to consistently outperform the Loss Masking strategy.
Introduction of Smoothed Label Distillation (SLD): The principal contribution of this study, SLD, involves incorporating a Kullback-Leibler (KL) divergence loss guided by smoothed labels to model speech tokens more effectively. This method counters the discretization noise disadvantages intrinsic to converting continuous speech signals into discrete tokens.
Numerical Validation: Through experiments on the LibriSpeech corpus, models equipped with SLD display a marked reduction in word error rates (WER) compared to those employing conventional Loss Masking and naive multimodal cross-entropy loss. The proposed method consistently outperforms its counterparts across various speech tasks.

Implications

This research underscores the potential for more precise training objectives in discrete-token-based ASR systems. By minimizing overconfidence typically introduced by pure cross-entropy methods, SLD enhances the model's generalization capabilities and performance robustness.

From a broader perspective, Shifting from conventional methods to SLD paves the way for optimizing similar decoder-only Transformer models across diverse speech processing tasks, potentially bolstering advancements in applications like speech-to-text translation and text-to-speech synthesis.

Future Direction

The insights presented in this paper indicate multiple avenues for future exploration. A further investigation could be conducted into the effects of different speech representation learning methods when implemented as discrete tokenizations. Moreover, a cross-dataset analysis would enhance understanding of SLD’s effectiveness in diverse linguistic settings and noise conditions.

The introduction of SLD marks a step towards refining model training in ASR, presenting an appealing option for researchers and practitioners aiming at more nuanced and efficient speech-token modeling strategies. As the field evolves, these methods may adapt and augment to achieve greater performance in multi-modal learning environments.