Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning

Published 12 Jan 2026 in math.OC and cs.LG | (2601.07326v1)

Abstract: This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}K E\left[|\nabla f(X_k)|*\right]\leq O(\frac{\sqrt{m+n}C}{K{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $|\nabla f(X)|_F\leq |\nabla f(X)|\leq \sqrt{m+n}|\nabla f(X)|F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}KE\left[|\nabla f(X_k)|F\right]\leq O(\frac{C}{K{1/4}})$ convergence rate of SGD in the ideal case of $|\nabla f(X)|= Θ(\sqrt{m+n})|\nabla f(X)|_F$.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.