AdamW Timescale Framework

Updated 19 January 2026

AdamW Timescale is a conceptual framework that interprets the weight decay hyperparameter as controlling an exponential moving average (EMA) timescale, defining memory and regularization characteristics.
The framework provides explicit scaling rules, indicating that weight decay should decrease inversely with dataset size and increase linearly with model width under μP scaling.
Empirical studies on models like ResNet-18, ViT, and NanoGPT validate that maintaining a constant EMA timescale leads to robust generalization and consistent convergence across varying training conditions.

AdamW timescale designates the conceptual and practical framework linking the weight decay hyperparameter in AdamW (decoupled weight decay Adam) to an underlying exponential moving average (EMA) timescale, providing explicit scaling rules for optimizer configuration across model and data scales. Underlying this framework are key theoretical, algorithmic, and empirical insights into adaptive optimization, scale-freeness, and robust generalization in deep learning.

1. AdamW as an EMA and the Definition of Timescale

AdamW performs parameter updates as

$w_t = (1 - \eta\lambda)w_{t-1} - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}$

where $w_t$ is the parameter vector at step $t$ , $\eta$ the learning rate, $\lambda$ the decoupled weight decay, and $\hat m_t,\hat v_t$ are bias-corrected first and second moment EMAs of the stochastic gradients. This update is structurally equivalent to an ordinary EMA:

$\mathrm{EMA}_t = (1 - 1/\tau)\, \mathrm{EMA}_{t-1} + (1/\tau)\, q_t$

with EMA timescale $\tau_\mathrm{iter} = 1/(\eta\lambda)$ . Interpreting the parameter update as an EMA over negative gradient steps yields a direct correspondence: $\lambda$ tunes the “memory” of the EMA process—large $\lambda$ means short memory and aggressive decay, small $\lambda$ means long memory and gentle regularization (Wang et al., 2024).

Converting to epochs, for a dataset with $M$ minibatches per epoch,

$\tau_\mathrm{epoch} = \frac{1}{\eta\lambda M}$

which allows timescale analysis in units natural to model and dataset scaling.

2. Timescale–Weight Decay Mapping and Scaling Laws

There exists a one-to-one mapping between the EMA timescale and the AdamW weight decay hyperparameter for a fixed learning rate:

$\lambda = \frac{1}{\eta\tau_\mathrm{iter}}$

$\lambda = \frac{1}{\eta M \tau_\mathrm{epoch}}$

Setting an optimal $\tau_\mathrm{epoch}^*$ (in the “sweet spot” of 1–5 epochs, empirically stable across problem scales) gives explicit scaling rules:

Increasing dataset size $M \uparrow$ while keeping $\tau_\mathrm{epoch}$ fixed requires $\lambda \propto 1/M$
Under μP learning rate scaling ( $\eta\propto1/\text{fan}$ ), keeping $\tau_\mathrm{iter}$ fixed implies $\lambda\propto\text{fan}$ , i.e., $\lambda$ should increase linearly with model width (Wang et al., 2024)

This formalism underlies robust optimizer transfer across dataset and model sizes.

3. Proximal Perspective and Timescale Robustness

AdamW can be derived as a first-order approximation to a diagonal proximal step for the composite objective $F(x) = f(x) + \tfrac{\lambda}{2}\|x\|^2$ . The AdamProx rule,

$x_t = (I + \lambda\eta_t I)^{-1}(x_{t-1} - \eta_t p_t)$

with $p_t = \alpha\hat m_t/(\sqrt{\hat v_t}+\epsilon)$ , Taylor expands for small $\lambda\eta_t$ to the AdamW update. The proximal step effectively decouples the regularization “pull” toward zero from the noisy direction of the adaptive gradient, conferring robustness against heterogeneity in gradient magnitudes—i.e., differing timescales for parameter drift across layers. AdamW maintains a consistent contraction rate for all coordinates, even with vanishing or exploding gradients (Zhuang et al., 2022).

4. Scale-Freeness and Invariance Across Timescales

AdamW is scale-free: its iterates are invariant when the per-coordinate gradients are rescaled by any fixed positive vector. This property is absent in, for example, Adam- $\ell_2$ , where the inhomogeneous regularizer term entangles gradient and parameter scales. Scale-freeness is operationally a type of automatic diagonal preconditioning: by neutralizing disparities in local gradient scales, the effective convergence timescales of all parameters are equalized, even in highly ill-conditioned or deep architectures (Zhuang et al., 2022). This is critical for consistent optimization when batch normalization is absent or depth-induced scaling pathologies are present.

5. Empirical Evidence and Prescriptive Rules

Empirical studies with ResNet-18 and ViT on CIFAR-10 and ImageNet, and NanoGPT on OpenWebText, demonstrate:

Optimal $\tau_\mathrm{epoch}$ (best test accuracy) lies in the range $1\text{--}8$ epochs, invariant across large variations in dataset size
Optimal $\lambda$ drops sharply as dataset size increases when $\tau_\mathrm{epoch}$ is held constant
μP model scaling (varying network width) with $\lambda\propto\text{fan}$ keeps learning curves and optimal learning rates synchronized across widths (Wang et al., 2024)

Further, large-scale LLM pretraining (Llama 1 & 2, Stable-LM) operate with initial/final $\tau_\mathrm{epoch}\approx0.1\text{--}3$ , confirming the practical invariance across scales.

Table: AdamW Weight Decay–Timescale Relations

Parameter	Formula	Scaling Rule
$\tau_\mathrm{iter}$	$1/(\eta\lambda)$	Hold constant across architectures
$\tau_\mathrm{epoch}$	$1/(\eta\lambda M)$	Hold constant across datasets
$\lambda$	$1/(\eta M\tau_\mathrm{epoch})$	$\propto 1/M$ , $\propto$ model fan-in

6. Theoretical Convergence and Practical Timescales

The AdamW theoretical convergence rate,

$\frac{1}{K}\sum_{k=1}^K \mathbb{E}\left[\|\nabla f(x^k)\|_1\right] \leq O\left(\frac{\sqrt{d}C}{K^{1/4}}\right)$

holds under standard assumptions (L-smoothness, unbiased gradients, bounded variance, and $\ell_\infty$ -confinement of iterates). Under a Gaussian-gradient model, $\mathbb{E}\left[\|\nabla f(x)\|_1\right] \asymp \sqrt{d}\,\mathbb{E}\left[\|\nabla f(x)\|_2\right]$ , making the $\ell_1$ convergence directly analogous to the optimal $O(C/K^{1/4})$ SGD bound in $\ell_2$ norm (Li et al., 17 May 2025).

Empirical validation shows this scaling holds in practice—gradient norms during training satisfy $\|\nabla f(x)\|_1/\|\nabla f(x)\|_2 \approx \sqrt{d}$ for ResNet50 and GPT2, and training loss decays smoothly at this theoretical rate (Li et al., 17 May 2025).

7. Practical Guidance and Implications

AdamW timescale analysis prescribes that practitioners:

Fix an EMA timescale $\tau_\mathrm{epoch}$ in the interval $[1, N_\mathrm{epochs}]$
For a given learning rate and batch configuration, set $\lambda = 1/(\eta M \tau_\mathrm{epoch}^*)$
Halve $\lambda$ when doubling the dataset size at fixed $\eta$ ; increase $\lambda$ linearly with model width under μP scaling
Monitor test accuracy as a function of $\tau_\mathrm{epoch}$ (not $\lambda$ ), exploiting empirical invariance for transferability between training regimes

A plausible implication is that AdamW’s timescale framing unifies adaptive regularization choice across scales, mitigating the need for hand-tuning and supporting robust, predictable optimization behavior in modern deep networks (Wang et al., 2024, Zhuang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

How to set AdamW's weight decay as you scale model and dataset size (2024)

Understanding AdamW through Proximal Methods and Scale-Freeness (2022)

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW Timescale.

AdamW Timescale Framework

1. AdamW as an EMA and the Definition of Timescale

2. Timescale–Weight Decay Mapping and Scaling Laws

3. Proximal Perspective and Timescale Robustness

4. Scale-Freeness and Invariance Across Timescales

5. Empirical Evidence and Prescriptive Rules

6. Theoretical Convergence and Practical Timescales

7. Practical Guidance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AdamW Timescale Framework

1. AdamW as an EMA and the Definition of Timescale

2. Timescale–Weight Decay Mapping and Scaling Laws

3. Proximal Perspective and Timescale Robustness

4. Scale-Freeness and Invariance Across Timescales

5. Empirical Evidence and Prescriptive Rules

6. Theoretical Convergence and Practical Timescales

7. Practical Guidance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research