Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing

Published 25 Mar 2019 in cs.LG, cs.AI, cs.CL, cs.CV, and stat.ML | (1903.10145v3)

Abstract: Variational autoencoders (VAEs) with an auto-regressive decoder have been applied for many NLP tasks. The VAE objective consists of two terms, (i) reconstruction and (ii) KL regularization, balanced by a weighting hyper-parameter \beta. One notorious training difficulty is that the KL term tends to vanish. In this paper we study scheduling schemes for \beta, and show that KL vanishing is caused by the lack of good latent codes in training the decoder at the beginning of optimization. To remedy this, we propose a cyclical annealing schedule, which repeats the process of increasing \beta multiple times. This new procedure allows the progressive learning of more meaningful latent codes, by leveraging the informative representations of previous cycles as warm re-starts. The effectiveness of cyclical annealing is validated on a broad range of NLP tasks, including language modeling, dialog response generation and unsupervised language pre-training.

Abstract PDF Upgrade to Chat

Citations (338)

View on Semantic Scholar

Summary

The paper introduces a cyclical annealing strategy for VAEs that mitigates KL vanishing by cycling the β parameter to reinforce latent code learning.
It employs multiple cycles of increasing β from zero to one, allowing previous latent representations to serve as warm restarts in subsequent training phases.
Experimental results on NLP tasks show improved perplexity, response diversity, and classification accuracy compared to monotonic annealing.

Cyclical Annealing Schedule: A Systematic Approach to Mitigating KL Vanishing in Variational Autoencoders

The paper discusses a novel cyclical annealing schedule to address the KL vanishing problem often encountered when training Variational Autoencoders (VAEs) in the context of NLP. The authors analyze the underlying reasons for KL vanishing and propose a cyclical schedule for the $\beta$ parameter that governs the trade-off between reconstruction and KL divergence terms in the VAE objective function.

Problem Background and Motivation

VAEs are powerful generative models that encode data into a latent space, facilitating tasks like language modeling and dialog generation. However, when VAEs employ auto-regressive decoders, they suffer from a prevalent issue known as KL vanishing. This phenomenon occurs when training collapses into a degenerate solution where the latent variable is ignored, resulting in inadequate utilization of the model’s expressiveness.

The traditional remedy involves monotonic KL annealing, gradually increasing $\beta$ from zero to one. Although this method partially addresses KL vanishing, the theoretical justification is limited, and the problem persists under certain conditions.

Cyclical Annealing Schedule

The authors propose a cyclical annealing schedule for $\beta$ , which consists of multiple cycles of linearly increasing $\beta$ from zero to one, followed by a period where $\beta$ remains at one. This process repeats over several cycles during training. The cyclical nature allows the VAE to leverage the latent codes progressively, utilizing informative representations learned in previous cycles as warm restarts in subsequent cycles.

This method is based on the observation that KL vanishing is often caused by the lack of meaningful latent codes at the early stages of optimization. By periodically resetting $\beta$ to zero, the cyclical schedule allows the VAE to relearn and reinforce the importance of the latent space during each cycle.

Experimental Results

Empirical results demonstrate that the cyclical schedule significantly mitigates KL vanishing across various NLP tasks, including language modeling on the Penn Tree Bank dataset, dialog response generation using Conditional VAEs, and unsupervised language pre-training. The cyclical schedule not only enhances ELBO and reduces reconstruction errors but also maintains higher KL divergence, indicating a more structured use of the latent space.

For language modeling, cyclical annealing enhances model robustness and overall perplexity scores. In dialog generation, it contributes to generating diverse responses, outperforming the monotonic schedule in both lexical similarity and embedding-based measures. In unsupervised pre-training, models trained with cyclical annealing demonstrate improved classification accuracy when fine-tuned on labeled data, showcasing the effectiveness of latent representation learning.

Theoretical Implications and Future Directions

The cyclical schedule presents a formal solution to the KL vanishing problem by systematically alternating between phases of weak and strong regularization on the latent space. This reveals insights into optimizing and balancing mutual information and marginal KL in VAE training. From a broader perspective, cyclical annealing could be extended to other domains and models where balancing reconstruction fidelity and latent space utilization is critical.

One potential avenue for future research could involve exploring adaptive algorithms that dynamically adjust $\beta$ cycles based on the model's learning state or performance metrics, potentially improving both computational efficiency and convergence stability. Further studies could also investigate the impact of cyclical annealing in semi-supervised or multi-modal settings, where leveraging rich latent representations is crucial.

Overall, this cyclical annealing schedule offers a methodical approach to enhance VAE training, effectively addressing challenges in leveraging the latent space and opening new directions for improving generative modeling in NLP and beyond.