Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

Published 29 Apr 2024 in stat.ML and cs.LG | (2404.19157v1)

Abstract: Large neural networks trained on large datasets have become the dominant paradigm in machine learning. These systems rely on maximum likelihood point estimates of their parameters, precluding them from expressing model uncertainty. This may result in overconfident predictions and it prevents the use of deep learning models for sequential decision making. This thesis develops scalable methods to equip neural networks with model uncertainty. In particular, we leverage the linearised Laplace approximation to equip pre-trained neural networks with the uncertainty estimates provided by their tangent linear models. This turns the problem of Bayesian inference in neural networks into one of Bayesian inference in conjugate Gaussian-linear models. Alas, the cost of this remains cubic in either the number of network parameters or in the number of observations times output dimensions. By assumption, neither are tractable. We address this intractability by using stochastic gradient descent (SGD) -- the workhorse algorithm of deep learning -- to perform posterior sampling in linear models and their convex duals: Gaussian processes. With this, we turn back to linearised neural networks, finding the linearised Laplace approximation to present a number of incompatibilities with modern deep learning practices -- namely, stochastic optimisation, early stopping and normalisation layers -- when used for hyperparameter learning. We resolve these and construct a sample-based EM algorithm for scalable hyperparameter learning with linearised neural networks. We apply the above methods to perform linearised neural network inference with ResNet-50 (25M parameters) trained on Imagenet (1.2M observations and 1000 output dimensions). Additionally, we apply our methods to estimate uncertainty for 3d tomographic reconstructions obtained with the deep image prior network.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents advancements in making Bayesian inference scalable for large deep learning models and datasets, addressing traditional limitations in model selection and uncertainty estimation.
Key methodological contributions include applying stochastic gradient descent (specifically introducing Stochastic Dual Descent) to Gaussian Processes and developing a scalable Linearised Laplace Approximation for deep neural networks.
The research demonstrates the practical application of these scalable methods for uncertainty quantification in large-scale tasks like image classification and sequential decision making such as Bayesian optimisation.

Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks

This thesis, entitled Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks, authored by Javier AntorÃ¡n Cabiscol, addresses two significant challenges in modern machine learning: model selection and uncertainty estimation, especially in the context of deep learning models. The work systematically advances the application of Bayesian methodologies, traditionally hindered by scalability and complexity issues, particularly for large neural networks and datasets.

Overview of Contributions

Stochastic Gradient Descent for Gaussian Processes:
- The thesis demonstrates how stochastic gradient descent (SGD), widely successful in deep learning, can also be effectively applied to Gaussian Processes (GPs). This approach allows to circumvent the traditional cubic computational costs associated with exact GP inference, making these powerful models applicable to large-scale datasets.
- A novel optimisation approach named Stochastic Dual Descent (SDD) is introduced. SDD significantly enhances the performance of SGD in performing Gaussian process inference by focusing on a better-conditioned dual objective.
- The associated numerical results confirm that SDD outperforms traditional methods like conjugate gradients (CG) and sparse variational GP approaches in both wall-clock time and predictive performance.
Scalable Linearised Laplace Approximation:
- The Laplace approximation, initially introduced for neural networks by MacKay, is revisited and extended to scale with modern deep learning practices. Antoran identifies key incompatibilities of classical Laplace approximation with state-of-the-art deep learning techniques, such as neural network normalization, and resolves these through innovative computational strategies.
- This thesis provides a systematic approach to hyperparameter selection using marginal likelihood maximisation, alleviating issues previously encountered due to non-convergence of network training or normalization-induced scale indeterminacy.
- The work proposes a sample-based Expectation-Maximisation algorithm to compute model evidence and posterior estimates efficiently, using methods developed in previous chapters.
Application to Uncertainty Quantification and Sequential Decision Making:
- The thesis doesn’t merely present theoretical advancements—these methods are exemplified across tasks requiring uncertainty estimation, such as image classification across datasets of a scale previously considered out of reach for Bayesian techniques.
- For tasks such as Bayesian optimisation, the approach empowers GPs to be competitive with deep learning methods, a field where they haven't traditionally excelled.

Theoretical and Practical Implications

The theoretical contributions push the boundaries of how models handle uncertainty and are selected in a data-driven manner. They revitalise classic Bayesian approaches by making them feasible in modern contexts, addressing both overconfidence in model predictions and computational bottlenecks.

Practically, this opens new doors for deploying uncertainty-aware models in real-world scenarios where large-scale, intricate datasets are common—ranging from adaptive experimental designs in scientific research to the robust deployment of AI systems in safety-critical applications.

Speculation on Future Developments

Antoran’s work hints at a convergence of deep learning and Bayesian inference methodologies. Future developments may include further fusion of these domains, such as integrating the developed scalable methods into more complex, hierarchical models or exploring their synergies with emerging methodologies like probabilistic programming or reinforcement learning frameworks.

In summary, this thesis provides substantial contributions to advancing the applicability of Bayesian methods within the domain of deep learning. Its innovations in stochastic optimisation for GPs and linearisation techniques for deep networks stand to influence ongoing research and practical implementations profoundly.