AdamW Timescale Framework
- AdamW Timescale is a conceptual framework that interprets the weight decay hyperparameter as controlling an exponential moving average (EMA) timescale, defining memory and regularization characteristics.
- The framework provides explicit scaling rules, indicating that weight decay should decrease inversely with dataset size and increase linearly with model width under μP scaling.
- Empirical studies on models like ResNet-18, ViT, and NanoGPT validate that maintaining a constant EMA timescale leads to robust generalization and consistent convergence across varying training conditions.
AdamW timescale designates the conceptual and practical framework linking the weight decay hyperparameter in AdamW (decoupled weight decay Adam) to an underlying exponential moving average (EMA) timescale, providing explicit scaling rules for optimizer configuration across model and data scales. Underlying this framework are key theoretical, algorithmic, and empirical insights into adaptive optimization, scale-freeness, and robust generalization in deep learning.
1. AdamW as an EMA and the Definition of Timescale
AdamW performs parameter updates as
where is the parameter vector at step , the learning rate, the decoupled weight decay, and are bias-corrected first and second moment EMAs of the stochastic gradients. This update is structurally equivalent to an ordinary EMA:
with EMA timescale . Interpreting the parameter update as an EMA over negative gradient steps yields a direct correspondence: tunes the “memory” of the EMA process—large means short memory and aggressive decay, small means long memory and gentle regularization (Wang et al., 2024).
Converting to epochs, for a dataset with minibatches per epoch,
which allows timescale analysis in units natural to model and dataset scaling.
2. Timescale–Weight Decay Mapping and Scaling Laws
There exists a one-to-one mapping between the EMA timescale and the AdamW weight decay hyperparameter for a fixed learning rate:
Setting an optimal (in the “sweet spot” of 1–5 epochs, empirically stable across problem scales) gives explicit scaling rules:
- Increasing dataset size while keeping fixed requires
- Under μP learning rate scaling (), keeping fixed implies , i.e., should increase linearly with model width (Wang et al., 2024)
This formalism underlies robust optimizer transfer across dataset and model sizes.
3. Proximal Perspective and Timescale Robustness
AdamW can be derived as a first-order approximation to a diagonal proximal step for the composite objective . The AdamProx rule,
with , Taylor expands for small to the AdamW update. The proximal step effectively decouples the regularization “pull” toward zero from the noisy direction of the adaptive gradient, conferring robustness against heterogeneity in gradient magnitudes—i.e., differing timescales for parameter drift across layers. AdamW maintains a consistent contraction rate for all coordinates, even with vanishing or exploding gradients (Zhuang et al., 2022).
4. Scale-Freeness and Invariance Across Timescales
AdamW is scale-free: its iterates are invariant when the per-coordinate gradients are rescaled by any fixed positive vector. This property is absent in, for example, Adam-, where the inhomogeneous regularizer term entangles gradient and parameter scales. Scale-freeness is operationally a type of automatic diagonal preconditioning: by neutralizing disparities in local gradient scales, the effective convergence timescales of all parameters are equalized, even in highly ill-conditioned or deep architectures (Zhuang et al., 2022). This is critical for consistent optimization when batch normalization is absent or depth-induced scaling pathologies are present.
5. Empirical Evidence and Prescriptive Rules
Empirical studies with ResNet-18 and ViT on CIFAR-10 and ImageNet, and NanoGPT on OpenWebText, demonstrate:
- Optimal (best test accuracy) lies in the range epochs, invariant across large variations in dataset size
- Optimal drops sharply as dataset size increases when is held constant
- μP model scaling (varying network width) with keeps learning curves and optimal learning rates synchronized across widths (Wang et al., 2024)
Further, large-scale LLM pretraining (Llama 1 & 2, Stable-LM) operate with initial/final , confirming the practical invariance across scales.
Table: AdamW Weight Decay–Timescale Relations
| Parameter | Formula | Scaling Rule |
|---|---|---|
| Hold constant across architectures | ||
| Hold constant across datasets | ||
| , model fan-in |
6. Theoretical Convergence and Practical Timescales
The AdamW theoretical convergence rate,
holds under standard assumptions (L-smoothness, unbiased gradients, bounded variance, and -confinement of iterates). Under a Gaussian-gradient model, , making the convergence directly analogous to the optimal SGD bound in norm (Li et al., 17 May 2025).
Empirical validation shows this scaling holds in practice—gradient norms during training satisfy for ResNet50 and GPT2, and training loss decays smoothly at this theoretical rate (Li et al., 17 May 2025).
7. Practical Guidance and Implications
AdamW timescale analysis prescribes that practitioners:
- Fix an EMA timescale in the interval
- For a given learning rate and batch configuration, set
- Halve when doubling the dataset size at fixed ; increase linearly with model width under μP scaling
- Monitor test accuracy as a function of (not ), exploiting empirical invariance for transferability between training regimes
A plausible implication is that AdamW’s timescale framing unifies adaptive regularization choice across scales, mitigating the need for hand-tuning and supporting robust, predictable optimization behavior in modern deep networks (Wang et al., 2024, Zhuang et al., 2022).