A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Published 16 Dec 2015 in stat.ML | (1512.05287v5)

Abstract: Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.

Abstract PDF Upgrade to Chat

Citations (1,606)

View on Semantic Scholar

Summary

The paper introduces a Bayesian approach to dropout in RNNs by treating weights as random variables and employing variational inference.
It applies consistent dropout masks across time steps in LSTM and GRU models, reducing test perplexity in language modeling tasks.
Empirical results on Penn Treebank and film review datasets demonstrate significant performance improvements in mitigating overfitting.

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Introduction

The paper "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" by Yarin Gal and Zoubin Ghahramani provides a rigorous exploration into the application of dropout in Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) models. The authors address the prevalent problem of overfitting in RNNs, which remains a critical challenge despite the empirical success of using dropout in feedforward networks. Contradicting previous empirical wisdom, the paper integrates a Bayesian perspective to justify dropout within RNNs, proposing a novel dropout method that applies consistent masks across time steps and also regularizes recurrent connections.

Theoretical Framework and Methodology

The authors build upon recent advances that interpret common deep learning techniques through the lens of Bayesian inference. According to this perspective, dropout can be construed as a form of variational inference, where it acts as an approximate Bayesian inference method. Consequently, the theoretical grounding offers an extension of RNNs into a probabilistic model framework, referred to as Variational RNNs.

The proposed method involves treating the weights in RNNs as random variables and employing a specific variational distribution, that of mixture Gaussians with means fixed at zero to approximate the posterior distribution over weights. This allows dropout to be applied uniformly across recurrent layers, input layers, and output layers using the same dropout mask at each time step. The optimization of variational parameters is guided by an objective function that targets minimization of the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior distribution over the weights.

Empirical Validation

The implementation of this new dropout variant is evaluated on standard tasks in language modeling using the Penn Treebank dataset and on sentiment analysis tasks. Key results from their experiments include:

Language Modeling: The proposed dropout method achieved a test perplexity of 73.4 on the Penn Treebank dataset using LSTM models. This is a significant improvement over the 78.4 perplexity observed with the traditional dropout technique applied by Zaremba et al. Moreover, the use of Model Averaging with these LSTMs further reduced perplexity to 68.7.
Sentiment Analysis: On the Cornell film reviews corpus, the Variational LSTM and GRU models demonstrated superior performance in a setting with limited labeled data. Notably, the test error reduction for Variational LSTM indicates that the proposed dropout method effectively mitigates overfitting.

Observations and Implications

The paper underscores several critical observations:

Uniform Mask Dropout Effectiveness: The consistent application of a single dropout mask across recurrent layers at every time step significantly contributes to model stability and regularization.
Parameter Regularization: The impact of regularizing embeddings, in conjunction with recurrent layers, plays a vital role in reducing overfitting, as demonstrated in both language modeling and sentiment analysis results.
Weight Decay Synergy: A notable finding is that weight decay still plays an important role even with dropout, which is often neglected in conventional dropout applications.

Future Directions

This method paves the way for further advancements, particularly in the robust estimation of uncertainty in RNN models. Understanding uncertainty is imperative for tasks that require reliable decision-making under uncertain conditions, such as natural language understanding and control tasks in reinforcement learning.

The Variational RNN framework also provides a promising direction for extending dropout applications to other complex sequence models, potentially improving robustness and generalization capabilities in a broader range of applications, including speech recognition and time series forecasting.

Conclusion

The theoretically motivated approach proposed in this paper not only provides a structured method to circumvent the limitations of dropout in RNNs but also contributes significantly to the practical applications of deep learning in sequence-based tasks. The combination of Bayesian principles with deep learning techniques demonstrates an effective pathway for enhancing model performance and generalizability. It establishes a foundation for future research endeavors to build upon, aiming to further leverage the potential of Bayesian inference in deep learning.

By evaluating and refining such theoretically sound methods, the paper extends the capabilities of RNNs, ensuring their applicability in more diverse and complex real-world scenarios.