Logarithmic-Time Weight-Decay

Updated 6 February 2026

The topic introduces logarithmic-time weight-decay, where sublinear L2 regularization decay aligns with information-theoretic and optimization principles to improve training efficiency.
It details reciprocal (1/t) and logarithmic decay schedules that, with a warmup phase, prevent singularities and maintain stability in gradient updates.
Empirical studies show that this approach yields 10–15% compute savings and robust performance across scales, especially in training language models.

Logarithmic-time weight-decay refers to a class of weight-decay schedules in which the $L_2$ -regularization coefficient decays sublinearly, typically either as $1/t$ or as a negative power of the logarithm of the training step, i.e., $\lambda(t) \propto [\log(1 + t/\tau)]^{-\gamma}$ . Such schedules are motivated both by statistical properties of language and optimization-theoretic considerations, and have demonstrated empirical efficiency gains in large-scale deep learning—especially in LLM training—relative to conventional constant-coefficient schemes (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).

1. Standard Weight-Decay in Deep Nets

Standard practice with modern deep learning optimizers such as AdamW is to maintain an exponential moving average of past gradients ( $\beta_1$ ), their squares ( $\beta_2$ ), and to apply a decoupled $L_2$ weight-decay term with fixed strength $\lambda$ at every step. The canonical AdamW update, omitting bias corrections, is: $\begin{aligned} m_{t+1} &= \beta_1 m_t + (1 - \beta_1) g_{t+1} \ v_{t+1} &= \beta_2 v_t + (1 - \beta_2) g_{t+1}^2 \ \theta_{t+1} &= \theta_t - \gamma(t)\left[\gamma^* \frac{m_{t+1}}{\sqrt{v_{t+1} + \epsilon}} + \lambda \theta_t\right] \end{aligned}$ where $\gamma(t)$ is a scheduled learning rate, $\gamma^*$ its peak, and $\epsilon$ a numerical stabilizer. Across large-scale training tasks, $(\beta_1, \beta_2, \lambda)$ are almost always held constant (e.g., $\beta_1=0.9$ , $\beta_2=0.999$ , $\lambda=10^{-2}$ ) (Ferbach et al., 5 Feb 2026).

2. Logarithmic-Time Weight-Decay Schedules

Logarithmic-time (log-time) weight-decay schedules replace the fixed $\lambda$ with a function that decays slowly over the course of training. The most common instantiation is the reciprocal rule: $\lambda_t = \frac{\omega}{t}$ where $\omega$ is a scale-invariant hyperparameter. To avoid singularities at early steps ( $\lambda_0 \to \infty$ ), a "warmup" of $t_0 \sim T/10$ is introduced ( $T$ = total training iterations), giving: $\lambda_t = \frac{\omega}{t + T/10}$ This schedule corresponds to an exponential decay per unit of log-time, i.e., under the substitution $\tau = \log t$ , $1/t \sim e^{-\tau}$ , and thus implements a constant attenuation in logarithmic time (Ferbach et al., 5 Feb 2026). An alternative, closely connected to statistical physics models of neural loss landscapes, is: $\lambda(t) = \lambda_0\, [\log(1 + t/\tau)]^{-\gamma}$ with $\gamma > 0$ and $\tau>0$ as hyperparameters (Richemond et al., 2019).

Schedule Type	Formula for $\lambda_t$	Hyperparameters
Reciprocal ("1/t")	$\omega/(t + T/10)$	$\omega$ , $T$
Logarithmic	$\lambda_0 [\log(1 + t/\tau)]^{-\gamma}$	$\lambda_0$ , $\gamma$ , $\tau$
Power-law	$\lambda_0/(1 + t)^{\gamma}$	$\lambda_0$ , $\gamma$

Both styles achieve a gradual reduction in regularization, matching the decaying information gain and improved signal-to-noise ratio as learning progresses.

3. Theoretical Motivation: Information-Theoretic and Optimization Perspectives

(a) Power-law Context Memory in Language

Information-theoretic studies (Shannon 1951, Hilberg 1990, Takahira 2016) establish that reductions in per-token uncertainty from increasing context length $t$ follow a power law, i.e., $\propto t^{\beta - 1}$ with $\beta \approx 0.88$ . This implies each additional token yields diminishing information, with an effective context horizon that grows sublinearly. Consequently, regularization early in training should be much stronger than late, aligning with the 1/t decay (Ferbach et al., 5 Feb 2026).

(b) Complexity-Matched Weight-Decay

From a statistical physics viewpoint, the loss surface of heavily overparameterized deep nets exhibits isotropic Gaussian properties with critical-point complexity $\Sigma$ , a quadratic form in the normalized energy (loss) $\epsilon$ and average Hessian eigenvalue $\bar{\lambda}$ . Complexity gradient descent prescribes that both energy and penalty decay in lockstep, yielding a "matched" schedule for weight-decay. This justifies closing the form of $\lambda(t)$ to mirror that of the learning rate, typically via logarithmic or power-law schedules (Richemond et al., 2019).

(c) Riemann-Sum Attenuation

Log-time schedules for $\lambda_t$ ensure that each geometric window $[T/2, T]$ contributes with constant attenuation to the parameter update, as opposed to the exponential attenuation of fixed $\lambda$ , thus better matching the power-law decay in information content (Ferbach et al., 5 Feb 2026).

4. Stability Properties and Coupling to Momentum

Logarithmic-time weight-decay, when used alone, introduces no novel instabilities to AdamW or SGD. The simple reciprocal rule $\lambda_t \leftarrow \omega/t$ requires no damping or adjustment for numerical stability. When pairing log-time decay of $\lambda$ with time-varying momentum hyperparameters $(\beta_1, \beta_2)$ —as in the "ADANA" optimizer—exact coordination and shared warmup offsets ( $t_0 = T/10$ ) across schedules are required to maintain stability and prevent the gradient-momentum balance from being overwhelmed. However, for weight-decay alone, these constraints are relaxed (Ferbach et al., 5 Feb 2026).

5. Empirical Effects and Efficiency Gains

Implementing 1/t weight-decay within AdamW for decoder-only transformers ("Enoki" models) trained on the FineWeb corpus (Chinchilla regime: 20 tokens/parameter) yields consistent empirical gains:

Optimal $\omega$ is scale-invariant: A single $\omega \approx 4$ (determined by a small-scale sweep, e.g., 6-head model) is near-optimal for model sizes from $45\,\mathrm{M}$ to $2.6\,\mathrm{B}$ parameters.
Compute-efficiency improvement: For identical validation loss, $10$– $15\%$ compute savings over fixed-weight-decay AdamW—quantified as

$\text{eff} = \frac{C_\text{base} - C_\text{new}}{C_\text{new}}$

where $C_\text{base}$ is compute required by constant $\lambda$ , and $C_\text{new}$ by log-time decay.

Efficiency gains increase slightly with model scale: $\sim8\%$ at $45\,\mathrm{M}$ parameters to $\sim12\%$ at $2.6\,\mathrm{B}$ .
Robustness: The same warmup offset $T/10$ suffices, no hyperparameter re-tuning required for different model sizes, and the gains persist under optimizer replacement (e.g., AdEMAMix: $7$– $12\%$ savings) (Ferbach et al., 5 Feb 2026).

6. Implementation and Practical Guidelines

To deploy logarithmic-time weight-decay:

Use decoupled WD (AdamW-type), not inline "L2 regularization."
Schedule: $\lambda_t = \omega/(t + t_0)$ , prefer $t_0 \sim T/10$ .
Choose $\omega$ on a small model, retain this value across scales.
Retain conventional values for $(\beta_1, \beta_2, \epsilon)$ , batch size, and learning-rate schedule.
If time-varying $\beta_1, \beta_2$ are employed, all schedules must share the same $t_0$ .
Warmup offset $t_0$ is essential to avoid instability at $t \to 0$ (Ferbach et al., 5 Feb 2026).

For SGD-style training, matched logarithmic or power-law scheduling is easily implemented:

for t in range(T):
    lr_t = eta0 * (log(1 + t/tau)) ** (-gamma)
    lambda_t = lambda0 * (log(1 + t/tau)) ** (-gamma)
    g = gradient_on_loss(w_t) + 2 * lambda_t * w_t
    w_t_plus_1 = w_t - lr_t * g

7. Summary and Outlook

Logarithmic-time weight-decay—with representative formula $\lambda_t = \omega/(t + T/10)$ —aligns the optimizer's regularization schedule with the power-law memory structure of natural language and the dynamics predicted by complexity-based loss landscape analysis. It is a computationally effective, scalable, minimally invasive modification requiring only a single transferable hyperparameter, and consistently provides $10$– $15\%$ reductions in training compute across transformer scales. This scheme requires no novel stabilization measures, can be incorporated as a drop-in to AdamW and related optimizers, and serves as a robust baseline for further advances in adaptive weight-decay scheduling (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

Logarithmic-time Schedules for Scaling Language Models with Momentum (2026)

Combining learning rate decay and weight decay with complexity gradient descent - Part I (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logarithmic-Time Weight-Decay.

Logarithmic-Time Weight-Decay

1. Standard Weight-Decay in Deep Nets

2. Logarithmic-Time Weight-Decay Schedules

3. Theoretical Motivation: Information-Theoretic and Optimization Perspectives

(a) Power-law Context Memory in Language

(b) Complexity-Matched Weight-Decay

(c) Riemann-Sum Attenuation

4. Stability Properties and Coupling to Momentum

5. Empirical Effects and Efficiency Gains

6. Implementation and Practical Guidelines

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Logarithmic-Time Weight-Decay

1. Standard Weight-Decay in Deep Nets

2. Logarithmic-Time Weight-Decay Schedules

3. Theoretical Motivation: Information-Theoretic and Optimization Perspectives

(a) Power-law Context Memory in Language

(b) Complexity-Matched Weight-Decay

(c) Riemann-Sum Attenuation

4. Stability Properties and Coupling to Momentum

5. Empirical Effects and Efficiency Gains

6. Implementation and Practical Guidelines

7. Summary and Outlook

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research