Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adapting Newton's Method to Neural Networks through a Summary of Higher-Order Derivatives

Published 6 Dec 2023 in cs.LG and math.OC | (2312.03885v3)

Abstract: When training large models, such as neural networks, the full derivatives of order 2 and beyond are usually inaccessible, due to their computational cost. This is why, among the second-order optimization methods, it is very common to bypass the computation of the Hessian by using first-order information, such as the gradient of the parameters (e.g., quasi-Newton methods) or the activations (e.g., K-FAC). In this paper, we focus on the exact and explicit computation of projections of the Hessian and higher-order derivatives on well-chosen subspaces, which are relevant for optimization. Namely, for a given partition of the set of parameters, it is possible to compute tensors which can be seen as "higher-order derivatives according to the partition", at a reasonable cost as long as the number of subsets of the partition remains small. Then, we propose an optimization method exploiting these tensors at order 2 and 3 with several interesting properties, including: it outputs a learning rate per subset of parameters, which can be used for hyperparameter tuning; it takes into account long-range interactions between the layers of the trained neural network, which is usually not the case in similar methods (e.g., K-FAC); the trajectory of the optimization is invariant under affine layer-wise reparameterization. Code available at https://github.com/p-wol/GroupedNewton/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276.
  2. Cauchy, A.-L. (1847). Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, Paris, 25:536–538.
  3. Dangel, F. J. (2023). Backpropagation beyond the gradient. PhD thesis, Universität Tübingen.
  4. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028. PMLR.
  5. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations.
  6. Practical optimization. Academic Press, San Diego.
  7. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE.
  8. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems, 7.
  9. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR.
  10. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  11. Optimal brain damage. Advances in neural information processing systems, 2.
  12. Second order properties of error surfaces: Learning time and generalization. Advances in neural information processing systems, 3.
  13. Linear and Nonlinear Programming. Springer, fourth edition.
  14. Martens, J. (2010). Deep learning via hessian-free optimization. In International conference on machine learning. PMLR.
  15. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR.
  16. Nesterov, Y. (2003). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
  17. Nesterov, Y. (2021). Superfast second-order methods for unconstrained convex optimization. Journal of Optimization Theory and Applications, 191:1–30.
  18. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205.
  19. Numerical optimization. Springer.
  20. Ollivier, Y. (2015). Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818.
  21. Pearlmutter, B. A. (1994). Fast exact multiplication by the hessian. Neural computation, 6(1):147–160.
  22. Empirical analysis of the hessian of over-parametrized neural networks. In International Conference on Learning Representations.
  23. A second-order learning algorithm for multilayer networks based on block hessian matrix. Neural Networks, 11(9):1607–1622.
  24. Sketched newton–raphson. SIAM Journal on Optimization, 32(3):1555–1583.
  25. Are all layers created equal? The Journal of Machine Learning Research, 23(1):2930–2957.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.