Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdamW-Style Shampoo Optimizer

Updated 19 January 2026
  • The optimizer combines matrix-based second moment tracking with decoupled weight decay to achieve rigorous convergence under nonconvex objectives.
  • It adapts between one-sided and two-sided preconditioning via tunable exponents, unifying elements of classical Shampoo and AdamW methods.
  • Empirical and theoretical results demonstrate improved optimization speed and stability for large-scale neural networks despite higher computational costs.

The AdamW-style Shampoo optimizer is an adaptive stochastic optimization algorithm that extends the original Shampoo method by combining matrix-based second-moment preconditioning with an AdamW-style decoupled weight decay. This approach leverages both one-sided and two-sided preconditioning schemes for tensor and matrix-valued parameters, and establishes rigorous convergence rates under nonconvex objectives. The optimizer has achieved empirical success in large-scale neural network training, securing first place in the external tuning track of the AlgoPerf neural network training algorithm competition (Li et al., 12 Jan 2026).

1. Mathematical Formulation and Algorithmic Structure

The AdamW-style Shampoo algorithm targets problems of the form

minXRm×nf(X)=EξP[f(X;ξ)]\min_{X \in \mathbb{R}^{m \times n}} f(X) = \mathbb{E}_{\xi \sim P}[f(X;\xi)]

where XkRm×nX_k \in \mathbb{R}^{m \times n} denotes the parameter matrix at iteration kk.

The core update for each iteration kk incorporates:

  • An exponential moving average of first moments MkM_k.
  • Two matrix-valued second-moment accumulators LkRm×mL_k \in \mathbb{R}^{m \times m} and RkRn×nR_k \in \mathbb{R}^{n \times n}.
  • Tunable preconditioning exponents p,q>0p, q > 0 such that $1/p + 1/q = 1$, generalizing between fully two-sided (p=q=2p = q = 2) and one-sided (p=1,q=p=1, q=\infty or p=,q=1p=\infty, q=1) schemes.
  • Decoupled weight decay, following the AdamW paradigm.

Update equations:

  1. Gkf(Xk;ξk)G_k \leftarrow \nabla f(X_k; \xi_k) (stochastic gradient).
  2. Mkβ1Mk1+(1β1)GkM_k \leftarrow \beta_1 M_{k-1} + (1-\beta_1) G_k (first moment, momentum).
  3. Lkβ2Lk1+(1β2)GkGkL_k \leftarrow \beta_2 L_{k-1} + (1-\beta_2) G_k G_k^\top, Rkβ2Rk1+(1β2)GkGkR_k \leftarrow \beta_2 R_{k-1} + (1-\beta_2) G_k^\top G_k (second moments).
  4. Lk,ϵLk+ϵImL_{k,\epsilon} \leftarrow L_k + \epsilon I_m, Rk,ϵRk+ϵInR_{k,\epsilon} \leftarrow R_k + \epsilon I_n (damping for invertibility).
  5. Xk+1(1λ)XkηLk,ϵ1/2pMkRk,ϵ1/2qX_{k+1} \leftarrow (1 - \lambda) X_k - \eta L_{k, \epsilon}^{-1/2p} M_k R_{k, \epsilon}^{-1/2q}.

Parameters include stepsize η\eta, momentum β1\beta_1, second-moment decay β2\beta_2, weight decay λ\lambda, damping ϵ\epsilon, and exponents p,qp, q.

The algorithm reduces to classical Shampoo when p=q=2p = q = 2 and to one-sided preconditioning in the respective limiting cases. The decoupled weight decay term (1λ)(1 - \lambda) is independent of the adaptive preconditioning steps (Li et al., 12 Jan 2026, Gupta et al., 2018).

2. Matrix Norms and Their Relationships

AdamW-style Shampoo measures convergence using various matrix norms:

  • Frobenius norm: AF=(i,jAij2)1/2\|A\|_F = (\sum_{i,j} A_{ij}^2)^{1/2}.
  • Nuclear norm: A=i=1rσi(A)\|A\|_* = \sum_{i=1}^r \sigma_i(A) (sum of singular values).
  • Spectral norm: Aop=maxiσi(A)\|A\|_{\text{op}} = \max_i \sigma_i(A).

A standard relationship holds:

AFAm+nAF\|A\|_F \leq \|A\|_* \leq \sqrt{m+n} \|A\|_F

for ARm×nA \in \mathbb{R}^{m \times n}, r=min{m,n}r = \min\{m, n\}.

This implies that nuclear-norm convergence rates translate to analogous rates in Frobenius norm up to a factor of m+n\sqrt{m+n} (Li et al., 12 Jan 2026).

3. Convergence Guarantees

AdamW-style Shampoo achieves the following convergence guarantee for the average nuclear norm of the gradient:

1Kk=1KE[f(Xk)]O(m+nCK1/4)\frac{1}{K}\sum_{k=1}^K \mathbb{E}\left[\|\nabla f(X_k)\|_*\right] \leq O\left(\frac{\sqrt{m+n}C}{K^{1/4}}\right)

where KK is the number of iterations, C=max{σ2,L(f(X1)f)}C = \max\{\sigma^2, L(f(X_1) - f^*)\} with ff^* the infimum of ff.

Under the conditions:

  • ff is LL-smooth in Frobenius norm,
  • Unbiased stochastic gradients with variance bounded by σ2\sigma^2,
  • All preconditioners satisfy Lk,ϵϵImL_{k,\epsilon} \succeq \epsilon I_m and Rk,ϵϵInR_{k,\epsilon} \succeq \epsilon I_n for all kk,

the optimizer matches the optimal rate O(C/K1/4)O(C/K^{1/4}) (in Frobenius norm) of stochastic gradient descent, up to the explicit m+n\sqrt{m+n} factor in the nuclear-norm bound (Li et al., 12 Jan 2026).

The analysis crucially exploits:

  • LL-smoothness to control function descent,
  • Hölder’s (Schatten-pp) inequality for bounding inner products in terms of the nuclear norm,
  • Matrix Cauchy–Schwarz inequalities for update stability,
  • Careful control over the accumulation of preconditioners.

In the ideal case where f(X)=Θ(m+n)f(X)F\|\nabla f(X)\|_* = \Theta(\sqrt{m+n})\|\nabla f(X)\|_F, this factor is tight, so the practical convergence rate parallels SGD.

4. One-Sided vs. Two-Sided Preconditioning

AdamW-style Shampoo unifies one-sided and two-sided preconditioning under a common algebraic formulation parameterized by the choice of exponents pp and qq:

  • Two-sided: Standard Shampoo (p=q=2p = q = 2): Xk+1=XkηLk,ϵ1/4MkRk,ϵ1/4X_{k+1} = X_k - \eta L_{k, \epsilon}^{-1/4} M_k R_{k, \epsilon}^{-1/4}.
  • One-sided: For p=1p = 1, q=q = \infty (or vice versa), one of the preconditioners reduces to the identity, yielding a “row-wise” or “column-wise” form.

This parameterization enables flexible adaptation to the structure of the problem and, in particular, enables targeted preconditioning based on tensor modes for higher-order tensors. Such flexibility allows the optimizer to interpolate between full-matrix and separate-mode updates (Li et al., 12 Jan 2026, Gupta et al., 2018).

5. Implementation and Complexity

For an m×nm \times n parameter matrix:

  • Memory requirements: Storage of two dense m×mm \times m and n×nn \times n accumulators for LkL_k and RkR_k, additional memory for momentum MkM_k, and original parameters XkX_k.
  • Computation per step: Updating each accumulator in O(m2n+n2m)O(m^2 n + n^2 m) flops, matrix roots/inverse roots in O(m3+n3)O(m^3 + n^3) (typically amortized), and the core update via matrix multiplications.
  • Decoupled weight decay: Implemented by (1λ)Xk(1 - \lambda) X_k scaling before the preconditioned gradient step, with no interaction between decay and the adaptive accumulators (Li et al., 12 Jan 2026, Gupta et al., 2018).

Relative to AdamW, Shampoo and its AdamW-style variant have increased complexity due to full-matrix second-moment tracking, but their per-step overhead remains comparable for typical (sub-1000 dimension) deep learning layers.

6. Extensions and Empirical Performance

Recent work has explored the interaction of Shampoo-style preconditioning with first-moment adaptation, most notably through connections to Adafactor and variants like SOAP, which runs Adam in the eigenbasis of Shampoo’s preconditioners. SOAP achieves further empirical improvements in large-batch LLM pretraining but also highlights the importance of fresh eigendecomposition for high-frequency preconditioner updates (Vyas et al., 2024).

The AdamW-style Shampoo optimizer won the external tuning track of AlgoPerf, evidencing its effectiveness in practical neural network training (Li et al., 12 Jan 2026). The addition of decoupled weight decay is particularly important as it preserves the separation between regularization and adaptivity, unlike classical (coupled) 2\ell_2 regularization.

  • Classical Shampoo: Employs only second-moment modes for preconditioning without built-in weight decay; regularization can be appended either in a coupled or decoupled fashion (Gupta et al., 2018).
  • SGD and AdamW: Track only diagonal or element-wise moments, which limits invariance to parameter scaling and reduces sharpness adaptation compared to the full-matrix approach.
  • SOAP (Vyas et al., 2024): Provides a formal bridge between Shampoo and Adafactor/Adam, demonstrating that adaptive diagonal moment tracking in a rotated basis yields improved robustness for eigendecomposition frequency—suggesting further directions for adaptive preconditioned optimizers.

Empirical results consistently show that second-order geometry and sophisticated preconditioning (as in Shampoo and AdamW-style Shampoo) can substantially improve optimization speed and stability for large-scale, ill-conditioned modern neural networks, provided the additional memory and computational requirements are managed appropriately.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW-Style Shampoo Optimizer.