On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Published 17 May 2025 in cs.LG and math.OC | (2505.11840v1)

Abstract: As the default optimizer for training LLMs, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^{KE\left[|\nabla} f(x^{k)|_1\right]\leq} O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[|\nabla f(x)|1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[|\nabla f(x)|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $|\nabla f(x)|_1=\varTheta(\sqrt{d})|\nabla f(x)|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}^{KE\left[|\nabla} f(x^{k)|_2\right]\leq} O(\frac{C}{K^{1/4}})$ convergence rate of SGD.