Parallel Trust-Region Approaches in Neural Network Training: Beyond Traditional Methods
Abstract: We propose to train neural networks (NNs) using a novel variant of the ``Additively Preconditioned Trust-region Strategy'' (APTS). The proposed method is based on a parallelizable additive domain decomposition approach applied to the neural network's parameters. Built upon the TR framework, the APTS method ensures global convergence towards a minimizer. Moreover, it eliminates the need for computationally expensive hyper-parameter tuning, as the TR algorithm automatically determines the step size in each iteration. We demonstrate the capabilities, strengths, and limitations of the proposed APTS training method by performing a series of numerical experiments. The presented numerical study includes a comparison with widely used training methods such as SGD, Adam, LBFGS, and the standard TR method.
- I. Goodfellow, Y. Bengio and A. Courville “Deep Learning” http://www.deeplearningbook.org MIT Press, 2016
- K. P. Murphy “Machine learning: A Probabilistic Perspective” MIT Press, 2012
- “Parallelized Stochastic Gradient Descent” In Advances in Neural Information Processing Systems 23, 2010
- L. Bottou, F. E. Curtis and J. Nocedal “Optimization Methods for Large-Scale Machine Learning” In SIAM Review 60.2, 2018, pp. 223–311
- “Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent” In arXiv preprint arXiv:1812.00542, 2018
- D. P. Kingma and J. Ba “Adam: A Method for Stochastic Optimization” In arXiv preprint arXiv:1412.6980, 2014
- S. J. Reddi, S. Kale and S. Kumar “On the Convergence of Adam and Beyond” In arXiv preprint arXiv:1904.09237, 2019
- S. Ruder “An overview of gradient descent optimization algorithms” In arXiv preprint arXiv:1609.04747, 2016
- J. Nocedal and S. J. Wright “Numerical Optimization” New York, NY: Springer New York, 1999
- A. S. Berahas, J. Nocedal and M. Takáč “A Multi-Batch L-BFGS Method for Machine Learning” In Advances in Neural Information Processing Systems 29, 2016
- “Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample” In Optimization Methods and Software 37.5, 2022, pp. 1668–1704
- A. R. Conn, N. I. Gould and P. L. Toint “Trust region methods” Society for IndustrialApplied Mathematics, 2000
- “On the Convergence of Recursive Trust-Region Methods for Multiscale Nonlinear Optimization and Applications to Nonlinear Mechanics” In SIAM Journal on Numerical Analysis 47.4, 2009, pp. 3044–3069 DOI: 10.1137/08071819X
- J. Rafati, O. DeGuchy and R. F. Marcia “Trust-Region Minimization Algorithm for Training Responses (TRMinATR): The Rise of Machine Learning Techniques” In 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2015–2019 IEEE
- F. E. Curtis, K. Scheinberg and R. Shi “A Stochastic Trust Region Algorithm Based on Careful Step Normalization” In Informs Journal on Optimization 1.3, 2019, pp. 200–220
- Y. Gao and M. K. Ng “A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization” In arXiv preprint arXiv:2210.05987, 2022
- S. Gratton, S. Jerad and P. L. Toint “Complexity of a Class of First-Order Objective-Function-Free Optimization Algorithms”, 2023
- S. Gratton, A. Kopaničáková and P. L. Toint “Multilevel Objective-Function-Free Optimization with an Application to Neural Networks Training” In SIAM Journal on Optimization 33.4 SIAM, 2023, pp. 2772–2800
- “Globally convergent multilevel training of deep residual networks” In SIAM Journal on Scientific Computing 0, 2022, pp. S254–S280
- J. Catlett “Mega induction: A Test Flight” In Machine Learning Proceedings 1991 Morgan Kaufmann, 1991, pp. 596–599
- “Training a 3-node neural network is NP-complete” In Advances in Neural Information Processing Systems 1, 1988
- Rupesh K Srivastava, Klaus Greff and Jürgen Schmidhuber “Training Very Deep Networks” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015
- R. Grosse “CSC321 Lecture 6: Backpropagation”, Lecture at the University of Toronto, 2018, pp. 21
- “Measuring the Effects of Data Parallelism on Neural Network Training” In arXiv preprint arXiv:1811.03600, 2018
- “A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks” In arXiv preprint arXiv:2111.04949, 2021
- S. Zhang, A. E. Choromanska and Y. LeCun “Deep Learning with Elastic Averaging SGD” In Advances in Neural Information Processing Systems 28, 2015
- L. K. Hansen and P. Salamon “Neural network ensembles” In IEEE Transactions on Pattern Analysis and Machine Intelligence 12.10, 1990, pp. 993–1001
- T. G. Dietterich “Ensemble Methods in Machine Learning” In International Workshop on Multiple Classifier Systems Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–15
- “Communication-Efficient Learning of Deep Networks from Decentralized Data” In Artificial Intelligence and Statistics PMLR, 2017, pp. 1273–1282
- “Proxskip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!” In International Conference on Machine Learning PMLR, 2022, pp. 15750–15769
- G. Malinovsky, K. Yi and P. Richtárik “Variance Reduced Proxskip: Algorithm, Theory and Application to Federated Learning” In Advances in Neural Information Processing Systems 35, 2022, pp. 15176–15189
- “Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis” In ACM Computing Surveys (CSUR) 52.4, 2019, pp. 1–43
- Tony F. Chan and Jun Zou “Additive Schwarz Domain Decomposition Methods for Elliptic Problems on Unstructured Meshes” In Numerical Algorithms 8.2, 1994, pp. 329–346
- “Domain Decomposition Methods - Algorithms and Theory” Springer Science & Business Media, 2004
- Martin J. Gander “Optimized Schwarz Methods” In SIAM Journal on Numerical Analysis 44.2, 2006, pp. 699–731
- “Domain Decomposition Methods in Science and Engineering XXI” Springer International Publishing, 2014
- T. P. Mathew “Domain Decomposition Methods for the Numerical Solution of Partial Differential Equations” Springer, 2008
- X. C. Cai and D. E. Keyes “Nonlinearly Preconditioned Inexact Newton Algorithms” In SIAM Journal on Scientific Computing 24.1, 2002, pp. 183–200
- “On the Globalization of ASPIN Employing Trust-Region Control Strategies–Convergence Analysis and Numerical Examples” In arXiv preprint arXiv:2104.05672, 2021
- “Nonlinear Preconditioning: How to Use a Nonlinear Schwarz Method to Precondition Newton’s Method” In SIAM Journal on Scientific Computing 38.6, 2016, pp. A3357–A3380
- C. Groß “A Unifying Theory for Nonlinear Additively and Multiplicatively Preconditioned Globalization Strategies Convergence Results and Examples From the Field of Nonlinear Elastostatics and Elastodynamics” URN: urn:nbn:de:hbz:5N-18682. Available online at: [Online-Publikationen an deutschen Hochschulen, Bonn, Univ., Diss., 2009], 2009
- “Decomposition and Composition of Deep Convolutional Neural Networks and Training Acceleration via Sub-Network Transfer Learning” In ETNA — Electronic Transactions on Numerical Analysis, 2022
- “Decomposition and Preconditioning of Deep Convolutional Neural Networks for Training Acceleration” In Domain Decomposition Methods in Science and Engineering XXVI Cham: Springer International Publishing, 2023, pp. 153–160
- “Layer-Parallel Training of Deep Residual Neural Networks” In SIAM Journal on Mathematics of Data Science 2.1, 2020, pp. 1–23
- “Enhancing training of physics-informed neural networks using domain-decomposition based preconditioning strategies” In arXiv preprint arXiv:2306.17648, 2023
- A. Klawonn, M. Lanser and J. Weber “A Domain Decomposition-Based CNN-DNN Architecture for Model Parallel Training Applied to Image Recognition Problems” In arXiv preprint arXiv:2302.06564, 2023
- J. B. Erway and R. F. Marcia “On efficiently computing the eigenvalues of limited-memory quasi-Newton matrices” In SIAM Journal on Matrix Analysis and Applications 36.3, 2015, pp. 1338–1359
- J. Brust, J. B. Erway and R. F. Marcia “On Solving L-SR1 Trust-Region Subproblems” In Computational Optimization and Applications 66, 2017, pp. 245–266
- S. A. Cruz Alegría and K. Trotti “ML_APTS” In GitHub repository GitHub, https://github.com/cruzas/ML_APTS, 2023
- “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11, 1998, pp. 2278–2324
- “Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance” In SIAM Journal on Scientific Computing 43.5, 2021, pp. A3438–A3468
- A. Krizhevsky “Learning Multiple Layers of Features from Tiny Images”, 2009
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.