AdamW-Style Shampoo Optimizer

Updated 19 January 2026

The optimizer combines matrix-based second moment tracking with decoupled weight decay to achieve rigorous convergence under nonconvex objectives.
It adapts between one-sided and two-sided preconditioning via tunable exponents, unifying elements of classical Shampoo and AdamW methods.
Empirical and theoretical results demonstrate improved optimization speed and stability for large-scale neural networks despite higher computational costs.

The AdamW-style Shampoo optimizer is an adaptive stochastic optimization algorithm that extends the original Shampoo method by combining matrix-based second-moment preconditioning with an AdamW-style decoupled weight decay. This approach leverages both one-sided and two-sided preconditioning schemes for tensor and matrix-valued parameters, and establishes rigorous convergence rates under nonconvex objectives. The optimizer has achieved empirical success in large-scale neural network training, securing first place in the external tuning track of the AlgoPerf neural network training algorithm competition (Li et al., 12 Jan 2026).

1. Mathematical Formulation and Algorithmic Structure

The AdamW-style Shampoo algorithm targets problems of the form

$\min_{X \in \mathbb{R}^{m \times n}} f(X) = \mathbb{E}_{\xi \sim P}[f(X;\xi)]$

where $X_k \in \mathbb{R}^{m \times n}$ denotes the parameter matrix at iteration $k$ .

The core update for each iteration $k$ incorporates:

An exponential moving average of first moments $M_k$ .
Two matrix-valued second-moment accumulators $L_k \in \mathbb{R}^{m \times m}$ and $R_k \in \mathbb{R}^{n \times n}$ .
Tunable preconditioning exponents $p, q > 0$ such that $1/p + 1/q = 1$, generalizing between fully two-sided ( $p = q = 2$ ) and one-sided ( $p=1, q=\infty$ or $p=\infty, q=1$ ) schemes.
Decoupled weight decay, following the AdamW paradigm.

Update equations:

$G_k \leftarrow \nabla f(X_k; \xi_k)$ (stochastic gradient).
$M_k \leftarrow \beta_1 M_{k-1} + (1-\beta_1) G_k$ (first moment, momentum).
$L_k \leftarrow \beta_2 L_{k-1} + (1-\beta_2) G_k G_k^\top$ , $R_k \leftarrow \beta_2 R_{k-1} + (1-\beta_2) G_k^\top G_k$ (second moments).
$L_{k,\epsilon} \leftarrow L_k + \epsilon I_m$ , $R_{k,\epsilon} \leftarrow R_k + \epsilon I_n$ (damping for invertibility).
$X_{k+1} \leftarrow (1 - \lambda) X_k - \eta L_{k, \epsilon}^{-1/2p} M_k R_{k, \epsilon}^{-1/2q}$ .

Parameters include stepsize $\eta$ , momentum $\beta_1$ , second-moment decay $\beta_2$ , weight decay $\lambda$ , damping $\epsilon$ , and exponents $p, q$ .

The algorithm reduces to classical Shampoo when $p = q = 2$ and to one-sided preconditioning in the respective limiting cases. The decoupled weight decay term $(1 - \lambda)$ is independent of the adaptive preconditioning steps (Li et al., 12 Jan 2026, Gupta et al., 2018).

2. Matrix Norms and Their Relationships

AdamW-style Shampoo measures convergence using various matrix norms:

Frobenius norm: $\|A\|_F = (\sum_{i,j} A_{ij}^2)^{1/2}$ .
Nuclear norm: $\|A\|_* = \sum_{i=1}^r \sigma_i(A)$ (sum of singular values).
Spectral norm: $\|A\|_{\text{op}} = \max_i \sigma_i(A)$ .

A standard relationship holds:

$\|A\|_F \leq \|A\|_* \leq \sqrt{m+n} \|A\|_F$

for $A \in \mathbb{R}^{m \times n}$ , $r = \min\{m, n\}$ .

This implies that nuclear-norm convergence rates translate to analogous rates in Frobenius norm up to a factor of $\sqrt{m+n}$ (Li et al., 12 Jan 2026).

3. Convergence Guarantees

AdamW-style Shampoo achieves the following convergence guarantee for the average nuclear norm of the gradient:

$\frac{1}{K}\sum_{k=1}^K \mathbb{E}\left[\|\nabla f(X_k)\|_*\right] \leq O\left(\frac{\sqrt{m+n}C}{K^{1/4}}\right)$

where $K$ is the number of iterations, $C = \max\{\sigma^2, L(f(X_1) - f^*)\}$ with $f^*$ the infimum of $f$ .

Under the conditions:

$f$ is $L$ -smooth in Frobenius norm,
Unbiased stochastic gradients with variance bounded by $\sigma^2$ ,
All preconditioners satisfy $L_{k,\epsilon} \succeq \epsilon I_m$ and $R_{k,\epsilon} \succeq \epsilon I_n$ for all $k$ ,

the optimizer matches the optimal rate $O(C/K^{1/4})$ (in Frobenius norm) of stochastic gradient descent, up to the explicit $\sqrt{m+n}$ factor in the nuclear-norm bound (Li et al., 12 Jan 2026).

The analysis crucially exploits:

$L$ -smoothness to control function descent,
Hölder’s (Schatten- $p$ ) inequality for bounding inner products in terms of the nuclear norm,
Matrix Cauchy–Schwarz inequalities for update stability,
Careful control over the accumulation of preconditioners.

In the ideal case where $\|\nabla f(X)\|_* = \Theta(\sqrt{m+n})\|\nabla f(X)\|_F$ , this factor is tight, so the practical convergence rate parallels SGD.

4. One-Sided vs. Two-Sided Preconditioning

AdamW-style Shampoo unifies one-sided and two-sided preconditioning under a common algebraic formulation parameterized by the choice of exponents $p$ and $q$ :

Two-sided: Standard Shampoo ( $p = q = 2$ ): $X_{k+1} = X_k - \eta L_{k, \epsilon}^{-1/4} M_k R_{k, \epsilon}^{-1/4}$ .
One-sided: For $p = 1$ , $q = \infty$ (or vice versa), one of the preconditioners reduces to the identity, yielding a “row-wise” or “column-wise” form.

This parameterization enables flexible adaptation to the structure of the problem and, in particular, enables targeted preconditioning based on tensor modes for higher-order tensors. Such flexibility allows the optimizer to interpolate between full-matrix and separate-mode updates (Li et al., 12 Jan 2026, Gupta et al., 2018).

5. Implementation and Complexity

For an $m \times n$ parameter matrix:

Memory requirements: Storage of two dense $m \times m$ and $n \times n$ accumulators for $L_k$ and $R_k$ , additional memory for momentum $M_k$ , and original parameters $X_k$ .
Computation per step: Updating each accumulator in $O(m^2 n + n^2 m)$ flops, matrix roots/inverse roots in $O(m^3 + n^3)$ (typically amortized), and the core update via matrix multiplications.
Decoupled weight decay: Implemented by $(1 - \lambda) X_k$ scaling before the preconditioned gradient step, with no interaction between decay and the adaptive accumulators (Li et al., 12 Jan 2026, Gupta et al., 2018).

Relative to AdamW, Shampoo and its AdamW-style variant have increased complexity due to full-matrix second-moment tracking, but their per-step overhead remains comparable for typical (sub-1000 dimension) deep learning layers.

6. Extensions and Empirical Performance

Recent work has explored the interaction of Shampoo-style preconditioning with first-moment adaptation, most notably through connections to Adafactor and variants like SOAP, which runs Adam in the eigenbasis of Shampoo’s preconditioners. SOAP achieves further empirical improvements in large-batch LLM pretraining but also highlights the importance of fresh eigendecomposition for high-frequency preconditioner updates (Vyas et al., 2024).

The AdamW-style Shampoo optimizer won the external tuning track of AlgoPerf, evidencing its effectiveness in practical neural network training (Li et al., 12 Jan 2026). The addition of decoupled weight decay is particularly important as it preserves the separation between regularization and adaptivity, unlike classical (coupled) $\ell_2$ regularization.

Classical Shampoo: Employs only second-moment modes for preconditioning without built-in weight decay; regularization can be appended either in a coupled or decoupled fashion (Gupta et al., 2018).
SGD and AdamW: Track only diagonal or element-wise moments, which limits invariance to parameter scaling and reduces sharpness adaptation compared to the full-matrix approach.
SOAP (Vyas et al., 2024): Provides a formal bridge between Shampoo and Adafactor/Adam, demonstrating that adaptive diagonal moment tracking in a rotated basis yields improved robustness for eigendecomposition frequency—suggesting further directions for adaptive preconditioned optimizers.

Empirical results consistently show that second-order geometry and sophisticated preconditioning (as in Shampoo and AdamW-style Shampoo) can substantially improve optimization speed and stability for large-scale, ill-conditioned modern neural networks, provided the additional memory and computational requirements are managed appropriately.

Markdown Report Issue Upgrade to Chat

References (3)

Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning (2026)

Shampoo: Preconditioned Stochastic Tensor Optimization (2018)

SOAP: Improving and Stabilizing Shampoo using Adam (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW-Style Shampoo Optimizer.

AdamW-Style Shampoo Optimizer

1. Mathematical Formulation and Algorithmic Structure

2. Matrix Norms and Their Relationships

3. Convergence Guarantees

4. One-Sided vs. Two-Sided Preconditioning

5. Implementation and Complexity

6. Extensions and Empirical Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AdamW-Style Shampoo Optimizer

1. Mathematical Formulation and Algorithmic Structure

2. Matrix Norms and Their Relationships

3. Convergence Guarantees

4. One-Sided vs. Two-Sided Preconditioning

5. Implementation and Complexity

6. Extensions and Empirical Performance

7. Related Algorithms and Theoretical Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research