Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning
Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}K E\left[|\nabla f(X_k)|*\right]\leq O(\frac{\sqrt{m+n}C}{K{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $|\nabla f(X)|_F\leq |\nabla f(X)|\leq \sqrt{m+n}|\nabla f(X)|F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}KE\left[|\nabla f(X_k)|F\right]\leq O(\frac{C}{K{1/4}})$ convergence rate of SGD in the ideal case of $|\nabla f(X)|= Θ(\sqrt{m+n})|\nabla f(X)|_F$.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.