Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

Published 17 May 2025 in cs.LG and math.OC | (2505.11840v1)

Abstract: As the default optimizer for training LLMs, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}KE\left[|\nabla f(xk)|_1\right]\leq O(\frac{\sqrt{d}C}{K{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[|\nabla f(x)|1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[|\nabla f(x)|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $|\nabla f(x)|_1=\varTheta(\sqrt{d})|\nabla f(x)|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum{k=1}KE\left[|\nabla f(xk)|_2\right]\leq O(\frac{C}{K{1/4}})$ convergence rate of SGD.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.