Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Published 7 Nov 2019 in cs.CL | (1911.02727v3)

Abstract: Non-autoregressive machine translation (NAT) systems predict a sequence of output tokens in parallel, achieving substantial improvements in generation speed compared to autoregressive models. Existing NAT models usually rely on the technique of knowledge distillation, which creates the training data from a pretrained autoregressive model for better performance. Knowledge distillation is empirically useful, leading to large gains in accuracy for NAT models, but the reason for this success has, as of yet, been unclear. In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial to NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve the state-of-the-art performance for the NAT-based models, and close the gap with the autoregressive baseline on WMT14 En-De benchmark.

Abstract PDF Upgrade to Chat

Citations (213)

View on Semantic Scholar

Summary

The paper demonstrates that distilling knowledge from an autoregressive teacher to a non-autoregressive student enhances translation quality.
It reveals that token-level distillation improves accuracy for individual word predictions while sequence-level distillation better captures contextual information.
Experimental results on synthetic datasets validate KD's effectiveness in balancing fast inference speeds with enhanced translation performance.

An Academic Overview of "Understanding Knowledge Distillation in Non-autoregressive Machine Translation"

The academic paper titled "Understanding Knowledge Distillation in Non-autoregressive Machine Translation" provides an in-depth investigation into the application of Knowledge Distillation (KD) within the domain of non-autoregressive models for machine translation (MT). This research aims to elucidate the role of KD in enhancing the performance of non-autoregressive MT models by simplifying the data distribution the models need to learn.

Core Contributions

The paper meticulously examines how KD can effectively bridge the performance gap between autoregressive and non-autoregressive translation models. Traditionally, non-autoregressive models have been competitive in terms of inference speed but often lag in translation quality compared to their autoregressive counterparts. The authors propose a novel methodology that leverages KD to distill knowledge from an autoregressive teacher model into a non-autoregressive student model.

Theoretical Foundations

Grounded in Bayesian decision theory, the paper formulates the task of structured prediction and introduces two primary loss functions evaluated in the context of sequence prediction: sequence-level loss and token-level loss. The distinction between these loss functions underpins the understanding of how KD can impact model performance. The research highlights that KD from sequence-level predictions yields improvements in token-level metrics, such as accuracy, due to the reduction in conditional probability complexity.

Experimental Findings

To substantiate their theoretical claims, the authors implement a series of experiments using a Hidden Markov Model (HMM) to generate synthetic datasets for evaluation. Using sequence-level and token-level generated labels for KD, the findings reveal that:

Models trained on token-level distilled data outperform others in token-level accuracy, underscoring the efficiency in token prediction tasks.
Conversely, sequence-level distilled data enhances performance in sequence-level accuracy measures, implying a broader contextual understanding is captured at each prediction step.

These results indicate the potential of KD to improve different evaluation metrics by refining the complexity and uncertainty in input distributions.

Practical Implications

The integration of KD in non-autoregressive translation models significantly augments their performance, making them viable alternatives for scenarios that favor faster inference speeds without a rigorous compromise on translation quality. This advancement propels non-autoregressive models closer to being operational in real-time translation applications that demand low latency.

Theoretical Implications and Future Directions

The analysis conducted in the paper reinforces the theoretical premise that non-autoregressive models can mimic complicated, high-capacity autoregressive models through KD effectively. This proposes several new avenues for research, including optimizing KD techniques for other forms of structured prediction tasks and exploring KD's potential beyond language translation, such as in summarization and text generation.

Future research could explore enhancing the distillation techniques further, optimizing the trade-off between speed and accuracy, and extending the proposed methodology to wider classes of neural architectures.

In summary, the paper provides a comprehensive exploration into how KD can be adeptly used within MT frameworks. It lays down a critical foundation for subsequent research to expand upon, fostering the development of efficient, high-performing non-autoregressive translation systems.