PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

Published 27 May 2025 in math.OC, cs.LG, and stat.ML | (2505.21799v2)

Abstract: The ever-growing scale of deep learning models and datasets underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and LLMs, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in LLM pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and LLM pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

Abstract PDF Upgrade to Chat

Summary

The paper introduces PolarGrad, a new matrix-gradient optimizer that employs polar decomposition and nuclear norm scaling to improve convergence and stability.
It presents a comprehensive theoretical analysis contrasting curvature and gradient anisotropy preconditioning strategies in deep learning training.
Experiments on matrix quadratic and logistic regression demonstrate PolarGrad's superior performance over traditional optimizers like Adam and Muon.

PolarGrad: A New Class of Matrix-Gradient Optimizers

Introduction to Matrix-Gradient Optimization

The ever-increasing scale of deep learning models and datasets highlights the importance of efficient optimization methods. While Adam and AdamW are widely used for training neural networks, structure-aware preconditioned optimizers like Shampoo and Muon promise faster convergence using matrix structures of gradients. The paper introduces a new framework for analyzing matrix-aware preconditioned optimization methods and proposes a new class of methods called PolarGrad, based on polar decomposition of matrix-valued gradients. This paper provides a theoretical analysis of preconditioning strategies and introduces algorithms using efficient numerical polar decomposition techniques.

Theoretical Foundations

The paper differentiates between two types of preconditioning strategies for optimization methods. Curvature anisotropy preconditioning focuses on reducing the condition number of the Hessian, while gradient anisotropy preconditioning targets the gradient condition number. Adam's preconditioning primarily deals with curvature anisotropy, whereas Muon's orthogonalized gradient method addresses gradient anisotropy by using semi-orthogonal matrices.

PolarGrad Methodology

PolarGrad incorporates matrix nuclear norm scaling with polar decomposition to improve directional updates and convergence rates. The main distinction from Muon is the inclusion of a nuclear norm scaling term to enhance convergence speed and stability, particularly in later stages of optimization. PolarGrad methods use either momentum-first or polar-first EMA momentum strategies, each offering different benefits.

Numerical Algorithms: The paper advocates using advanced polar decomposition algorithms such as QDWH and ZOLO-PD for efficient computation, reducing reliance on parameter tuning unlike the Newton--Schulz iteration in Muon.

Figure 1: Losses, residual and gradient condition numbers of matrix quadratic regression.

Numerical Experiments

The results demonstrate that PolarGrad outperforms Adam and Muon across several tasks, such as matrix quadratic regression and matrix logistic regression, in terms of convergence speed and stability. The experiments highlight the impact of the nuclear norm scaling and the use of efficient polar decomposition algorithms in driving faster convergence.

Figure 2: Gradient nuclear norms of matrix quadratic regression.

Implications and Future Work

This new class of matrix optimization algorithms presents significant improvements in convergence and stability, making PolarGrad a promising candidate for large-scale model training. Future work will focus on developing more optimized implementations and expanding the applicability to various model architectures, including multi-modal models and MoEs.

Conclusion

The paper provides a comprehensive unifying preconditioning perspective on gradient methods and introduces PolarGrad as an innovative optimization technique. PolarGrad's foundation on polar decomposition coupled with nuclear norm scaling introduces viable solutions for training large-scale models with matrix-valued gradients. The results show its potential to mitigate training instabilities commonly faced with other optimizers, such as Adam(W), especially for LLMs, positioning PolarGrad as a competitive choice in the field of advanced neural network training.