Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdamW Timescale Framework

Updated 19 January 2026
  • AdamW Timescale is a conceptual framework that interprets the weight decay hyperparameter as controlling an exponential moving average (EMA) timescale, defining memory and regularization characteristics.
  • The framework provides explicit scaling rules, indicating that weight decay should decrease inversely with dataset size and increase linearly with model width under μP scaling.
  • Empirical studies on models like ResNet-18, ViT, and NanoGPT validate that maintaining a constant EMA timescale leads to robust generalization and consistent convergence across varying training conditions.

AdamW timescale designates the conceptual and practical framework linking the weight decay hyperparameter in AdamW (decoupled weight decay Adam) to an underlying exponential moving average (EMA) timescale, providing explicit scaling rules for optimizer configuration across model and data scales. Underlying this framework are key theoretical, algorithmic, and empirical insights into adaptive optimization, scale-freeness, and robust generalization in deep learning.

1. AdamW as an EMA and the Definition of Timescale

AdamW performs parameter updates as

wt=(1ηλ)wt1ηm^tv^t+ϵw_t = (1 - \eta\lambda)w_{t-1} - \eta\, \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}

where wtw_t is the parameter vector at step tt, η\eta the learning rate, λ\lambda the decoupled weight decay, and m^t,v^t\hat m_t,\hat v_t are bias-corrected first and second moment EMAs of the stochastic gradients. This update is structurally equivalent to an ordinary EMA:

EMAt=(11/τ)EMAt1+(1/τ)qt\mathrm{EMA}_t = (1 - 1/\tau)\, \mathrm{EMA}_{t-1} + (1/\tau)\, q_t

with EMA timescale τiter=1/(ηλ)\tau_\mathrm{iter} = 1/(\eta\lambda). Interpreting the parameter update as an EMA over negative gradient steps yields a direct correspondence: λ\lambda tunes the “memory” of the EMA process—large λ\lambda means short memory and aggressive decay, small λ\lambda means long memory and gentle regularization (Wang et al., 2024).

Converting to epochs, for a dataset with MM minibatches per epoch,

τepoch=1ηλM\tau_\mathrm{epoch} = \frac{1}{\eta\lambda M}

which allows timescale analysis in units natural to model and dataset scaling.

2. Timescale–Weight Decay Mapping and Scaling Laws

There exists a one-to-one mapping between the EMA timescale and the AdamW weight decay hyperparameter for a fixed learning rate:

λ=1ητiter\lambda = \frac{1}{\eta\tau_\mathrm{iter}}

λ=1ηMτepoch\lambda = \frac{1}{\eta M \tau_\mathrm{epoch}}

Setting an optimal τepoch\tau_\mathrm{epoch}^* (in the “sweet spot” of 1–5 epochs, empirically stable across problem scales) gives explicit scaling rules:

  • Increasing dataset size MM \uparrow while keeping τepoch\tau_\mathrm{epoch} fixed requires λ1/M\lambda \propto 1/M
  • Under μP learning rate scaling (η1/fan\eta\propto1/\text{fan}), keeping τiter\tau_\mathrm{iter} fixed implies λfan\lambda\propto\text{fan}, i.e., λ\lambda should increase linearly with model width (Wang et al., 2024)

This formalism underlies robust optimizer transfer across dataset and model sizes.

3. Proximal Perspective and Timescale Robustness

AdamW can be derived as a first-order approximation to a diagonal proximal step for the composite objective F(x)=f(x)+λ2x2F(x) = f(x) + \tfrac{\lambda}{2}\|x\|^2. The AdamProx rule,

xt=(I+ληtI)1(xt1ηtpt)x_t = (I + \lambda\eta_t I)^{-1}(x_{t-1} - \eta_t p_t)

with pt=αm^t/(v^t+ϵ)p_t = \alpha\hat m_t/(\sqrt{\hat v_t}+\epsilon), Taylor expands for small ληt\lambda\eta_t to the AdamW update. The proximal step effectively decouples the regularization “pull” toward zero from the noisy direction of the adaptive gradient, conferring robustness against heterogeneity in gradient magnitudes—i.e., differing timescales for parameter drift across layers. AdamW maintains a consistent contraction rate for all coordinates, even with vanishing or exploding gradients (Zhuang et al., 2022).

4. Scale-Freeness and Invariance Across Timescales

AdamW is scale-free: its iterates are invariant when the per-coordinate gradients are rescaled by any fixed positive vector. This property is absent in, for example, Adam-2\ell_2, where the inhomogeneous regularizer term entangles gradient and parameter scales. Scale-freeness is operationally a type of automatic diagonal preconditioning: by neutralizing disparities in local gradient scales, the effective convergence timescales of all parameters are equalized, even in highly ill-conditioned or deep architectures (Zhuang et al., 2022). This is critical for consistent optimization when batch normalization is absent or depth-induced scaling pathologies are present.

5. Empirical Evidence and Prescriptive Rules

Empirical studies with ResNet-18 and ViT on CIFAR-10 and ImageNet, and NanoGPT on OpenWebText, demonstrate:

  • Optimal τepoch\tau_\mathrm{epoch} (best test accuracy) lies in the range 181\text{--}8 epochs, invariant across large variations in dataset size
  • Optimal λ\lambda drops sharply as dataset size increases when τepoch\tau_\mathrm{epoch} is held constant
  • μP model scaling (varying network width) with λfan\lambda\propto\text{fan} keeps learning curves and optimal learning rates synchronized across widths (Wang et al., 2024)

Further, large-scale LLM pretraining (Llama 1 & 2, Stable-LM) operate with initial/final τepoch0.13\tau_\mathrm{epoch}\approx0.1\text{--}3, confirming the practical invariance across scales.

Table: AdamW Weight Decay–Timescale Relations

Parameter Formula Scaling Rule
τiter\tau_\mathrm{iter} 1/(ηλ)1/(\eta\lambda) Hold constant across architectures
τepoch\tau_\mathrm{epoch} 1/(ηλM)1/(\eta\lambda M) Hold constant across datasets
λ\lambda 1/(ηMτepoch)1/(\eta M\tau_\mathrm{epoch}) 1/M\propto 1/M, \propto model fan-in

6. Theoretical Convergence and Practical Timescales

The AdamW theoretical convergence rate,

1Kk=1KE[f(xk)1]O(dCK1/4)\frac{1}{K}\sum_{k=1}^K \mathbb{E}\left[\|\nabla f(x^k)\|_1\right] \leq O\left(\frac{\sqrt{d}C}{K^{1/4}}\right)

holds under standard assumptions (L-smoothness, unbiased gradients, bounded variance, and \ell_\infty-confinement of iterates). Under a Gaussian-gradient model, E[f(x)1]dE[f(x)2]\mathbb{E}\left[\|\nabla f(x)\|_1\right] \asymp \sqrt{d}\,\mathbb{E}\left[\|\nabla f(x)\|_2\right], making the 1\ell_1 convergence directly analogous to the optimal O(C/K1/4)O(C/K^{1/4}) SGD bound in 2\ell_2 norm (Li et al., 17 May 2025).

Empirical validation shows this scaling holds in practice—gradient norms during training satisfy f(x)1/f(x)2d\|\nabla f(x)\|_1/\|\nabla f(x)\|_2 \approx \sqrt{d} for ResNet50 and GPT2, and training loss decays smoothly at this theoretical rate (Li et al., 17 May 2025).

7. Practical Guidance and Implications

AdamW timescale analysis prescribes that practitioners:

  • Fix an EMA timescale τepoch\tau_\mathrm{epoch} in the interval [1,Nepochs][1, N_\mathrm{epochs}]
  • For a given learning rate and batch configuration, set λ=1/(ηMτepoch)\lambda = 1/(\eta M \tau_\mathrm{epoch}^*)
  • Halve λ\lambda when doubling the dataset size at fixed η\eta; increase λ\lambda linearly with model width under μP scaling
  • Monitor test accuracy as a function of τepoch\tau_\mathrm{epoch} (not λ\lambda), exploiting empirical invariance for transferability between training regimes

A plausible implication is that AdamW’s timescale framing unifies adaptive regularization choice across scales, mitigating the need for hand-tuning and supporting robust, predictable optimization behavior in modern deep networks (Wang et al., 2024, Zhuang et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdamW Timescale.