Logarithmic-Time Weight-Decay
- The topic introduces logarithmic-time weight-decay, where sublinear L2 regularization decay aligns with information-theoretic and optimization principles to improve training efficiency.
- It details reciprocal (1/t) and logarithmic decay schedules that, with a warmup phase, prevent singularities and maintain stability in gradient updates.
- Empirical studies show that this approach yields 10–15% compute savings and robust performance across scales, especially in training language models.
Logarithmic-time weight-decay refers to a class of weight-decay schedules in which the -regularization coefficient decays sublinearly, typically either as $1/t$ or as a negative power of the logarithm of the training step, i.e., . Such schedules are motivated both by statistical properties of language and optimization-theoretic considerations, and have demonstrated empirical efficiency gains in large-scale deep learning—especially in LLM training—relative to conventional constant-coefficient schemes (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).
1. Standard Weight-Decay in Deep Nets
Standard practice with modern deep learning optimizers such as AdamW is to maintain an exponential moving average of past gradients (), their squares (), and to apply a decoupled weight-decay term with fixed strength at every step. The canonical AdamW update, omitting bias corrections, is: where is a scheduled learning rate, its peak, and a numerical stabilizer. Across large-scale training tasks, are almost always held constant (e.g., , , ) (Ferbach et al., 5 Feb 2026).
2. Logarithmic-Time Weight-Decay Schedules
Logarithmic-time (log-time) weight-decay schedules replace the fixed with a function that decays slowly over the course of training. The most common instantiation is the reciprocal rule: where is a scale-invariant hyperparameter. To avoid singularities at early steps (), a "warmup" of is introduced ( = total training iterations), giving: This schedule corresponds to an exponential decay per unit of log-time, i.e., under the substitution , , and thus implements a constant attenuation in logarithmic time (Ferbach et al., 5 Feb 2026). An alternative, closely connected to statistical physics models of neural loss landscapes, is: with and as hyperparameters (Richemond et al., 2019).
| Schedule Type | Formula for | Hyperparameters |
|---|---|---|
| Reciprocal ("1/t") | , | |
| Logarithmic | , , | |
| Power-law | , |
Both styles achieve a gradual reduction in regularization, matching the decaying information gain and improved signal-to-noise ratio as learning progresses.
3. Theoretical Motivation: Information-Theoretic and Optimization Perspectives
(a) Power-law Context Memory in Language
Information-theoretic studies (Shannon 1951, Hilberg 1990, Takahira 2016) establish that reductions in per-token uncertainty from increasing context length follow a power law, i.e., with . This implies each additional token yields diminishing information, with an effective context horizon that grows sublinearly. Consequently, regularization early in training should be much stronger than late, aligning with the 1/t decay (Ferbach et al., 5 Feb 2026).
(b) Complexity-Matched Weight-Decay
From a statistical physics viewpoint, the loss surface of heavily overparameterized deep nets exhibits isotropic Gaussian properties with critical-point complexity , a quadratic form in the normalized energy (loss) and average Hessian eigenvalue . Complexity gradient descent prescribes that both energy and penalty decay in lockstep, yielding a "matched" schedule for weight-decay. This justifies closing the form of to mirror that of the learning rate, typically via logarithmic or power-law schedules (Richemond et al., 2019).
(c) Riemann-Sum Attenuation
Log-time schedules for ensure that each geometric window contributes with constant attenuation to the parameter update, as opposed to the exponential attenuation of fixed , thus better matching the power-law decay in information content (Ferbach et al., 5 Feb 2026).
4. Stability Properties and Coupling to Momentum
Logarithmic-time weight-decay, when used alone, introduces no novel instabilities to AdamW or SGD. The simple reciprocal rule requires no damping or adjustment for numerical stability. When pairing log-time decay of with time-varying momentum hyperparameters —as in the "ADANA" optimizer—exact coordination and shared warmup offsets () across schedules are required to maintain stability and prevent the gradient-momentum balance from being overwhelmed. However, for weight-decay alone, these constraints are relaxed (Ferbach et al., 5 Feb 2026).
5. Empirical Effects and Efficiency Gains
Implementing 1/t weight-decay within AdamW for decoder-only transformers ("Enoki" models) trained on the FineWeb corpus (Chinchilla regime: 20 tokens/parameter) yields consistent empirical gains:
- Optimal is scale-invariant: A single (determined by a small-scale sweep, e.g., 6-head model) is near-optimal for model sizes from to parameters.
- Compute-efficiency improvement: For identical validation loss, $10$– compute savings over fixed-weight-decay AdamW—quantified as
where is compute required by constant , and by log-time decay.
- Efficiency gains increase slightly with model scale: at parameters to at .
- Robustness: The same warmup offset suffices, no hyperparameter re-tuning required for different model sizes, and the gains persist under optimizer replacement (e.g., AdEMAMix: $7$– savings) (Ferbach et al., 5 Feb 2026).
6. Implementation and Practical Guidelines
To deploy logarithmic-time weight-decay:
- Use decoupled WD (AdamW-type), not inline "L2 regularization."
- Schedule: , prefer .
- Choose on a small model, retain this value across scales.
- Retain conventional values for , batch size, and learning-rate schedule.
- If time-varying are employed, all schedules must share the same .
- Warmup offset is essential to avoid instability at (Ferbach et al., 5 Feb 2026).
For SGD-style training, matched logarithmic or power-law scheduling is easily implemented:
1 2 3 4 5 |
for t in range(T): lr_t = eta0 * (log(1 + t/tau)) ** (-gamma) lambda_t = lambda0 * (log(1 + t/tau)) ** (-gamma) g = gradient_on_loss(w_t) + 2 * lambda_t * w_t w_t_plus_1 = w_t - lr_t * g |
7. Summary and Outlook
Logarithmic-time weight-decay—with representative formula —aligns the optimizer's regularization schedule with the power-law memory structure of natural language and the dynamics predicted by complexity-based loss landscape analysis. It is a computationally effective, scalable, minimally invasive modification requiring only a single transferable hyperparameter, and consistently provides $10$– reductions in training compute across transformer scales. This scheme requires no novel stabilization measures, can be incorporated as a drop-in to AdamW and related optimizers, and serves as a robust baseline for further advances in adaptive weight-decay scheduling (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).