Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel Trust-Region Approaches in Neural Network Training: Beyond Traditional Methods

Published 21 Dec 2023 in math.NA, cs.LG, and cs.NA | (2312.13677v1)

Abstract: We propose to train neural networks (NNs) using a novel variant of the ``Additively Preconditioned Trust-region Strategy'' (APTS). The proposed method is based on a parallelizable additive domain decomposition approach applied to the neural network's parameters. Built upon the TR framework, the APTS method ensures global convergence towards a minimizer. Moreover, it eliminates the need for computationally expensive hyper-parameter tuning, as the TR algorithm automatically determines the step size in each iteration. We demonstrate the capabilities, strengths, and limitations of the proposed APTS training method by performing a series of numerical experiments. The presented numerical study includes a comparison with widely used training methods such as SGD, Adam, LBFGS, and the standard TR method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. I. Goodfellow, Y. Bengio and A. Courville “Deep Learning” http://www.deeplearningbook.org MIT Press, 2016
  2. K. P. Murphy “Machine learning: A Probabilistic Perspective” MIT Press, 2012
  3. “Parallelized Stochastic Gradient Descent” In Advances in Neural Information Processing Systems 23, 2010
  4. L. Bottou, F. E. Curtis and J. Nocedal “Optimization Methods for Large-Scale Machine Learning” In SIAM Review 60.2, 2018, pp. 223–311
  5. “Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent” In arXiv preprint arXiv:1812.00542, 2018
  6. D. P. Kingma and J. Ba “Adam: A Method for Stochastic Optimization” In arXiv preprint arXiv:1412.6980, 2014
  7. S. J. Reddi, S. Kale and S. Kumar “On the Convergence of Adam and Beyond” In arXiv preprint arXiv:1904.09237, 2019
  8. S. Ruder “An overview of gradient descent optimization algorithms” In arXiv preprint arXiv:1609.04747, 2016
  9. J. Nocedal and S. J. Wright “Numerical Optimization” New York, NY: Springer New York, 1999
  10. A. S. Berahas, J. Nocedal and M. Takáč “A Multi-Batch L-BFGS Method for Machine Learning” In Advances in Neural Information Processing Systems 29, 2016
  11. “Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample” In Optimization Methods and Software 37.5, 2022, pp. 1668–1704
  12. A. R. Conn, N. I. Gould and P. L. Toint “Trust region methods” Society for IndustrialApplied Mathematics, 2000
  13. “On the Convergence of Recursive Trust-Region Methods for Multiscale Nonlinear Optimization and Applications to Nonlinear Mechanics” In SIAM Journal on Numerical Analysis 47.4, 2009, pp. 3044–3069 DOI: 10.1137/08071819X
  14. J. Rafati, O. DeGuchy and R. F. Marcia “Trust-Region Minimization Algorithm for Training Responses (TRMinATR): The Rise of Machine Learning Techniques” In 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2015–2019 IEEE
  15. F. E. Curtis, K. Scheinberg and R. Shi “A Stochastic Trust Region Algorithm Based on Careful Step Normalization” In Informs Journal on Optimization 1.3, 2019, pp. 200–220
  16. Y. Gao and M. K. Ng “A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization” In arXiv preprint arXiv:2210.05987, 2022
  17. S. Gratton, S. Jerad and P. L. Toint “Complexity of a Class of First-Order Objective-Function-Free Optimization Algorithms”, 2023
  18. S. Gratton, A. Kopaničáková and P. L. Toint “Multilevel Objective-Function-Free Optimization with an Application to Neural Networks Training” In SIAM Journal on Optimization 33.4 SIAM, 2023, pp. 2772–2800
  19. “Globally convergent multilevel training of deep residual networks” In SIAM Journal on Scientific Computing 0, 2022, pp. S254–S280
  20. J. Catlett “Mega induction: A Test Flight” In Machine Learning Proceedings 1991 Morgan Kaufmann, 1991, pp. 596–599
  21. “Training a 3-node neural network is NP-complete” In Advances in Neural Information Processing Systems 1, 1988
  22. Rupesh K Srivastava, Klaus Greff and Jürgen Schmidhuber “Training Very Deep Networks” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015
  23. R. Grosse “CSC321 Lecture 6: Backpropagation”, Lecture at the University of Toronto, 2018, pp. 21
  24. “Measuring the Effects of Data Parallelism on Neural Network Training” In arXiv preprint arXiv:1811.03600, 2018
  25. “A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks” In arXiv preprint arXiv:2111.04949, 2021
  26. S. Zhang, A. E. Choromanska and Y. LeCun “Deep Learning with Elastic Averaging SGD” In Advances in Neural Information Processing Systems 28, 2015
  27. L. K. Hansen and P. Salamon “Neural network ensembles” In IEEE Transactions on Pattern Analysis and Machine Intelligence 12.10, 1990, pp. 993–1001
  28. T. G. Dietterich “Ensemble Methods in Machine Learning” In International Workshop on Multiple Classifier Systems Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–15
  29. “Communication-Efficient Learning of Deep Networks from Decentralized Data” In Artificial Intelligence and Statistics PMLR, 2017, pp. 1273–1282
  30. “Proxskip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!” In International Conference on Machine Learning PMLR, 2022, pp. 15750–15769
  31. G. Malinovsky, K. Yi and P. Richtárik “Variance Reduced Proxskip: Algorithm, Theory and Application to Federated Learning” In Advances in Neural Information Processing Systems 35, 2022, pp. 15176–15189
  32. “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis” In ACM Computing Surveys (CSUR) 52.4, 2019, pp. 1–43
  33. Tony F. Chan and Jun Zou “Additive Schwarz Domain Decomposition Methods for Elliptic Problems on Unstructured Meshes” In Numerical Algorithms 8.2, 1994, pp. 329–346
  34. “Domain Decomposition Methods - Algorithms and Theory” Springer Science & Business Media, 2004
  35. Martin J. Gander “Optimized Schwarz Methods” In SIAM Journal on Numerical Analysis 44.2, 2006, pp. 699–731
  36. “Domain Decomposition Methods in Science and Engineering XXI” Springer International Publishing, 2014
  37. T. P. Mathew “Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations” Springer, 2008
  38. X. C. Cai and D. E. Keyes “Nonlinearly Preconditioned Inexact Newton Algorithms” In SIAM Journal on Scientific Computing 24.1, 2002, pp. 183–200
  39. “On the Globalization of ASPIN Employing Trust-Region Control Strategies–Convergence Analysis and Numerical Examples” In arXiv preprint arXiv:2104.05672, 2021
  40. “Nonlinear Preconditioning: How to Use a Nonlinear Schwarz Method to Precondition Newton’s Method” In SIAM Journal on Scientific Computing 38.6, 2016, pp. A3357–A3380
  41. C. Groß “A Unifying Theory for Nonlinear Additively and Multiplicatively Preconditioned Globalization Strategies Convergence Results and Examples From the Field of Nonlinear Elastostatics and Elastodynamics” URN: urn:nbn:de:hbz:5N-18682. Available online at: [Online-Publikationen an deutschen Hochschulen, Bonn, Univ., Diss., 2009], 2009
  42. “Decomposition and Composition of Deep Convolutional Neural Networks and Training Acceleration via Sub-Network Transfer Learning” In ETNA — Electronic Transactions on Numerical Analysis, 2022
  43. “Decomposition and Preconditioning of Deep Convolutional Neural Networks for Training Acceleration” In Domain Decomposition Methods in Science and Engineering XXVI Cham: Springer International Publishing, 2023, pp. 153–160
  44. “Layer-Parallel Training of Deep Residual Neural Networks” In SIAM Journal on Mathematics of Data Science 2.1, 2020, pp. 1–23
  45. “Enhancing training of physics-informed neural networks using domain-decomposition based preconditioning strategies” In arXiv preprint arXiv:2306.17648, 2023
  46. A. Klawonn, M. Lanser and J. Weber “A Domain Decomposition-Based CNN-DNN Architecture for Model Parallel Training Applied to Image Recognition Problems” In arXiv preprint arXiv:2302.06564, 2023
  47. J. B. Erway and R. F. Marcia “On efficiently computing the eigenvalues of limited-memory quasi-Newton matrices” In SIAM Journal on Matrix Analysis and Applications 36.3, 2015, pp. 1338–1359
  48. J. Brust, J. B. Erway and R. F. Marcia “On Solving L-SR1 Trust-Region Subproblems” In Computational Optimization and Applications 66, 2017, pp. 245–266
  49. S. A. Cruz Alegría and K. Trotti “ML_APTS” In GitHub repository GitHub, https://github.com/cruzas/ML_APTS, 2023
  50. “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11, 1998, pp. 2278–2324
  51. “Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance” In SIAM Journal on Scientific Computing 43.5, 2021, pp. A3438–A3468
  52. A. Krizhevsky “Learning Multiple Layers of Features from Tiny Images”, 2009

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.