Noisy Natural Gradient as Variational Inference

Published 6 Dec 2017 in cs.LG and stat.ML | (1712.02390v2)

Abstract: Variational Bayesian neural nets combine the flexibility of deep learning with Bayesian uncertainty estimation. Unfortunately, there is a tradeoff between cheap but simple variational families (e.g.~fully factorized) or expensive and complicated inference procedures. We show that natural gradient ascent with adaptive weight noise implicitly fits a variational posterior to maximize the evidence lower bound (ELBO). This insight allows us to train full-covariance, fully factorized, or matrix-variate Gaussian variational posteriors using noisy versions of natural gradient, Adam, and K-FAC, respectively, making it possible to scale up to modern-size ConvNets. On standard regression benchmarks, our noisy K-FAC algorithm makes better predictions and matches Hamiltonian Monte Carlo's predictive variances better than existing methods. Its improved uncertainty estimates lead to more efficient exploration in active learning, and intrinsic motivation for reinforcement learning.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (200)

View on Semantic Scholar

Summary

The paper introduces a novel approach that uses adaptive noisy natural gradient methods to approximate variational posteriors in Bayesian Neural Networks.
It demonstrates enhanced performance on regression and classification tasks by employing optimizers like Noisy K-FAC, yielding improved RMSE, log-likelihood, and accuracy.
The study bridges natural gradient descent with variational inference, offering scalable uncertainty estimation with practical applications in active learning and reinforcement learning.

Noisy Natural Gradient as Variational Inference

This paper explores an innovative approach to training Bayesian Neural Networks (BNNs) using Noisy Natural Gradient (NNG) methods, providing an efficient mechanism to integrate variational inference into deep learning tasks. It establishes a conceptual bridge between natural gradient methods and variational inference, presenting NNG as an effective optimization strategy for BNNs with Gaussian variational posteriors.

Overview

The central contribution of the paper is the insight that natural gradient descent with adaptive weight noise effectively fits a variational posterior which maximizes the evidence lower bound (ELBO). This approach bypasses the enormous computational burdens typically associated with full covariance posterior calculations by leveraging noisy natural gradient methods, such as Adam and K-FAC, to update parameters.

Methodology and Results

The authors introduce a procedure to utilize existing natural gradient optimizers like Adam and K-FAC in a noisy form, which they term as Noisy Adam and Noisy K-FAC. These noisy optimizers are designed to handle various posterior structures, such as fully factorized or matrix-variate Gaussian posteriors.

Regression Benchmarks: The paper demonstrates superior predictive performance using the noisy K-FAC method on standard regression datasets by comparing test RMSE and log-likelihood against popular methods like Bayes By Backprop (BBB) and Hamiltonian Monte Carlo (HMC).
Classification Tasks: On the CIFAR-10 benchmark using a modified VGG16 network, the Noisy K-FAC method yielded higher accuracy compared to traditional SGD and K-FAC, maintaining competitive results even with data augmentation and batch normalization.
Uncertainty Estimation: Improved uncertainty estimates were evidenced by better alignment with HMC's predictive variances, corroborated by Pearson correlation analyses on further regression datasets.
Applications in Active Learning and Reinforcement Learning: The improved estimation capability allowed more efficient exploration in active learning setups and yielded better intrinsic motivation measures in reinforcement learning scenarios.

Implications and Future Work

This research has significant implications for scalable and efficient training of BNNs. By linking NGPE with NGVI, the authors provide a robust framework for using adaptive weight noise to achieve fast convergence while capturing complex weight correlations within attainable computation rates. The alignment of variational posteriors with natural gradient methodologies opens new paths for exploring adaptive dynamics in more complex model architectures and tasks requiring refined uncertainty measures.

In the field of future research, the framework described can be extended to explore other variational posterior formats, investigate different neural architectures, and apply it to large-scale datasets beyond what was tested. Particularly, given the intrinsic damping property observed, further exploration into optimizing damping strategies could yield additional performance benefits.

In summary, the paper delivers a comprehensive strategy to incorporate variational inference into neural network training efficiently. This innovation in handling Bayesian uncertainties not only advances theoretical understanding but also injects practical advantages into deep learning methodologies.

Markdown Report Issue