Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives
Abstract: When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. This is why, among the second-order optimization methods, it is very common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces, which are relevant for optimization. Namely, for a given partition of the set of parameters, it is possible to compute tensors which can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we propose an optimization method exploiting these tensors at order 2 and 3 with several interesting properties, including: it outputs a learning rate per subset of parameters, which can be used for hyperparameter tuning; it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization. Code available at https://github.com/p-wol/GroupedNewton/ .
- Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276.
- Cauchy, A.-L. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, Paris, 25:536–538.
- Dangel, F. J. (2023). Backpropagation beyond the gradient. PhD thesis, Universität Tübingen.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
- Practical optimization. Academic Press, San Diego.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE.
- Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7.
- Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Optimal brain damage. Advances in neural information processing systems, 2.
- Second order properties of error surfaces: Learning time and generalization. Advances in neural information processing systems, 3.
- Linear and Nonlinear Programming. Springer, fourth edition.
- Martens, J. (2010). Deep learning via hessian-free optimization. In International conference on machine learning. PMLR.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR.
- Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
- Nesterov, Y. (2021). Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191:1–30.
- Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
- Numerical optimization. Springer.
- Ollivier, Y. (2015). Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818.
- Pearlmutter, B. A. (1994). Fast exact multiplication by the hessian. Neural computation, 6(1):147–160.
- Empirical analysis of the hessian of over-parametrized neural networks. In International Conference on Learning Representations.
- A second-order learning algorithm for multilayer networks based on block hessian matrix. Neural Networks, 11(9):1607–1622.
- Sketched newton–raphson. SIAM Journal on Optimization, 32(3):1555–1583.
- Are all layers created equal? The Journal of Machine Learning Research, 23(1):2930–2957.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.