Enhancing Transfer Learning with Flexible Nonparametric Posterior Sampling
Abstract: Transfer learning has recently shown significant performance across various tasks involving deep neural networks. In these transfer learning scenarios, the prior distribution for downstream data becomes crucial in Bayesian model averaging (BMA). While previous works proposed the prior over the neural network parameters centered around the pre-trained solution, such strategies have limitations when dealing with distribution shifts between upstream and downstream data. This paper introduces nonparametric transfer learning (NPTL), a flexible posterior sampling method to address the distribution shift issue within the context of nonparametric learning. The nonparametric learning (NPL) method is a recent approach that employs a nonparametric prior for posterior sampling, efficiently accounting for model misspecification scenarios, which is suitable for transfer learning scenarios that may involve the distribution shift between upstream and downstream tasks. Through extensive empirical validations, we demonstrate that our approach surpasses other baselines in BMA performance.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Charles E Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics, pp. 1152–1174, 1974.
- Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations (ICLR), 2020.
- The DeepMind JAX Ecosystem, 2020. URL http://github.com/deepmind.
- The second PASCAL recognising textual entailment challenge, 2006.
- The fifth PASCAL recognizing textual entailment challenge, 2009.
- Weight uncertainty in neural network. In Proceedings of The 32nd International Conference on Machine Learning (ICML), 2015.
- Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision (ECCV), 2014.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language, 20(4):382–399, 2006.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
- Stochastic gradient hamiltonian monte carlo. In Proceedings of The 31st International Conference on Machine Learning (ICML), 2014.
- A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
- Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pp. 177–190. Springer, 2006.
- Repulsive deep ensembles are bayesian. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Hal Daumé III. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263, 2007.
- Parallel mcmc without embarrassing failures. In International Conference on Artificial Intelligence and Statistics, pp. 1786–1804. PMLR, 2022.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018.
- Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics, pp. 209–230, 1973.
- On the marginal likelihood and cross-validation. Biometrika, 107(2):489–496, 2020.
- Scalable nonparametric sampling from multimodal posteriors with the posterior bootstrap. In Proceedings of The 36th International Conference on Machine Learning (ICML), 2019.
- Vincent Fortuin. Priors in bayesian deep learning: A review. International Statistical Review, 90(3):563–591, 2022.
- Bayesian neural network priors revisited. In International Conference on Learning Representations (ICLR), 2022.
- Subhashis Ghosal and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
- The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Association for Computational Linguistics, 2007.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- Maarten Grachten and Carlos Eduardo Cancino Chacón. Strategies for conceptual change in convolutional neural networks. arXiv preprint arXiv:1711.01634, 2017.
- Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NIPS), 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations (ICLR), 2019.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Fast predictive uncertainty for classification with bayesian deep networks. In Uncertainty in Artificial Intelligence, pp. 822–832. PMLR, 2022.
- Bayesian model averaging: a tutorial. Statistical Science, 14(4):382–401, 1999.
- CC Holmes and SG Walker. General bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2):465–478, 2019.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning (ICML), 2015.
- Averaging weights leads to wider optima and better generalization. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp. 876–885, 2018.
- What are bayesian neural network posteriors really like? In Proceedings of The 38th International Conference on Machine Learning (ICML), 2021.
- Fast and scalable bayesian deep learning by weight-perturbation in adam. In Proceedings of The 35th International Conference on Machine Learning (ICML), 2018.
- Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, June 2011.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Specifying weight priors in bayesian deep neural networks with empirical bayes. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4477–4484, 2020.
- Learning multiple layers of features from tiny images, 2009.
- A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems (NIPS), 1991.
- Caltech 101, Apr 2022.
- Rethinking the hyperparameters for fine-tuning. In International Conference on Learning Representations (ICLR), 2020.
- Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763, 2020.
- Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in Neural Information Processing Systems (NIPS), 2016.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Nonparametric learning from bayesian models with randomized objective functions. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems (NIPS), 2015.
- A simple baseline for Bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019.
- Ensemble distribution distillation. In International Conference on Learning Representations (ICLR), 2020.
- Efficient estimation of word representations in vector space. In International Conference on Learning Representations (ICLR), 2013.
- Ulrich K Müller. Risk of bayesian inference in misspecified models, and the sandwich covariance matrix. Econometrica, 81(5):1805–1849, 2013.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
- Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
- Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 623–632, 2014.
- What is being transferred in transfer learning? In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008. doi: 10.1109/ICVGIP.2008.47.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Cats and dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- Exchangeably weighted bootstraps of the general empirical process. The Annals of Probability, pp. 2053–2086, 1993.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Black box variational inference. In Artificial intelligence and statistics, pp. 814–822. PMLR, 2014.
- H. Robbins. An empirical Bayes approach to statistics. Berkeley Symposium on Mathematical Statistics and Probability, 1956.
- A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Neural bootstrapper. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Pre-train your loss: Easy bayesian transfer learning with informative priors. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852, 2017.
- J. Tomczak and M. Welling. VAE with a VampPrior. In Proceedings of The 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018), 2018.
- 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008. doi: 10.1109/TPAMI.2008.128.
- The caltech-ucsd birds-200-2011 dataset, Jul 2011.
- Stephen G Walker. Bayesian inference with misspecified models. Journal of statistical planning and inference, 143(10):1621–1633, 2013.
- Towards making deep transfer learning never hurt. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 578–587. IEEE, 2019.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJ4km2R5t7.
- Neural network acceptability judgments. arXiv preprint 1805.12471, 2018.
- Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of The 28th International Conference on Machine Learning (ICML), 2011.
- How good is the Bayes posterior in deep neural networks really? In Proceedings of The 37th International Conference on Machine Learning (ICML), 2020.
- Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of The 39th International Conference on Machine Learning (ICML), 2022.
- Explicit inductive bias for transfer learning with convolutional networks. In Proceedings of The 35th International Conference on Machine Learning (ICML), 2018.
- Wide residual networks. In Procedings of the British Machine Vision Conference 2016. British Machine Vision Association, 2016.
- Why are adaptive methods good for attention models? In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.