Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy-MCMC: Sampling from Flat Basins with Ease

Published 9 Oct 2023 in cs.LG and stat.ML | (2310.05401v5)

Abstract: Bayesian deep learning counts on the quality of posterior distribution estimation. However, the posterior of deep neural networks is highly multi-modal in nature, with local modes exhibiting varying generalization performance. Given a practical budget, targeting at the original posterior can lead to suboptimal performance, as some samples may become trapped in "bad" modes and suffer from overfitting. Leveraging the observation that "good" modes with low generalization error often reside in flat basins of the energy landscape, we propose to bias sampling on the posterior toward these flat regions. Specifically, we introduce an auxiliary guiding variable, the stationary distribution of which resembles a smoothed posterior free from sharp modes, to lead the MCMC sampler to flat basins. By integrating this guiding variable with the model parameter, we create a simple joint distribution that enables efficient sampling with minimal computational overhead. We prove the convergence of our method and further show that it converges faster than several existing flatness-aware methods in the strongly convex setting. Empirical results demonstrate that our method can successfully sample from flat basins of the posterior, and outperforms all compared baselines on multiple benchmarks including classification, calibration, and out-of-distribution detection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. On the convergence of sgd with biased gradients. Journal of Machine Learning Research, 2020. URL https://api.semanticscholar.org/CorpusID:234358812.
  2. An introduction to mcmc for machine learning. Machine learning, 50:5–43, 2003.
  3. Sharpness-aware minimization improves language model generalization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7360–7371, 2022.
  4. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences, 113(48):E7655–E7662, 2016.
  5. Christopher M Bishop. Pattern recognition and machine learning, volume 4. Springer, 2006.
  6. Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape. In International Conference on Artificial Intelligence and Statistics, pp.  8299–8339. PMLR, 2022.
  7. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
  8. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pp. 1683–1691. PMLR, 2014.
  9. Scaling hamiltonian monte carlo inference for bayesian neural networks with symmetric splitting. In Uncertainty in Artificial Intelligence, pp.  675–685. PMLR, 2021.
  10. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
  13. Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In International Conference on Machine Learning, pp. 1377–1386. PMLR, 2018.
  14. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
  15. Alan E Gelfand. Gibbs sampling. Journal of the American statistical Association, 95(452):1300–1304, 2000.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  17. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.
  18. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019a.
  19. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019b.
  20. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7, 1994.
  21. Flat minima. Neural computation, 9(1):1–42, 1997.
  22. Understanding generalization through visualizations. In ”I Can’t Believe It’s Not Better!”NeurIPS workshop, 2020.
  23. Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp.  876–885, 2018.
  24. What are bayesian neural network posteriors really like? In International conference on machine learning, pp. 4629–4640. PMLR, 2021.
  25. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2019a.
  26. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2019b.
  27. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
  28. A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009.
  29. Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
  30. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  661–670, 2014.
  31. A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015.
  32. Firefly monte carlo: exact mcmc with subsets of data. In Proceedings of the 24th International Conference on Artificial Intelligence, pp.  4289–4295, 2015.
  33. Predictive uncertainty estimation via prior networks. Advances in neural information processing systems, 31, 2018.
  34. Donna Katzman McClish. Analyzing a portion of the roc curve. Medical decision making, 9(3):190–195, 1989.
  35. Hossein Mobahi. Training recurrent neural networks by diffusion. arXiv preprint arXiv:1601.04114, 2016.
  36. Sam as an optimal relaxation of bayes. In The Eleventh International Conference on Learning Representations, 2022.
  37. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
  38. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  39. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  40. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.
  41. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11237–11246, 2020.
  42. Larry Wasserman. All of statistics: a concise course in statistical inference, volume 26. Springer, 2004.
  43. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp.  681–688, 2011.
  44. Andrew Gordon Wilson. The case for bayesian deep learning. arXiv preprint arXiv:2001.10995, 2020.
  45. Asymptotically optimal exact minibatch metropolis-hastings. Advances in Neural Information Processing Systems, 33:19500–19510, 2020a.
  46. Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2020b.
Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 33 likes about this paper.