Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Published 12 Jan 2026 in math.OC and cs.LG | (2601.07326v1)

Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[|\nabla f(X_k)|*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $|\nabla f(X)|_F\leq |\nabla f(X)|\leq \sqrt{m+n}|\nabla f(X)|F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}^{KE\left[|\nabla} f(X_k)|F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $|\nabla f(X)|= Θ(\sqrt{m+n})|\nabla f(X)|_F$.