Papers
Topics
Authors
Recent
Search
2000 character limit reached

Logarithmic-Time Weight-Decay

Updated 6 February 2026
  • The topic introduces logarithmic-time weight-decay, where sublinear L2 regularization decay aligns with information-theoretic and optimization principles to improve training efficiency.
  • It details reciprocal (1/t) and logarithmic decay schedules that, with a warmup phase, prevent singularities and maintain stability in gradient updates.
  • Empirical studies show that this approach yields 10–15% compute savings and robust performance across scales, especially in training language models.

Logarithmic-time weight-decay refers to a class of weight-decay schedules in which the L2L_2-regularization coefficient decays sublinearly, typically either as $1/t$ or as a negative power of the logarithm of the training step, i.e., λ(t)[log(1+t/τ)]γ\lambda(t) \propto [\log(1 + t/\tau)]^{-\gamma}. Such schedules are motivated both by statistical properties of language and optimization-theoretic considerations, and have demonstrated empirical efficiency gains in large-scale deep learning—especially in LLM training—relative to conventional constant-coefficient schemes (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).

1. Standard Weight-Decay in Deep Nets

Standard practice with modern deep learning optimizers such as AdamW is to maintain an exponential moving average of past gradients (β1\beta_1), their squares (β2\beta_2), and to apply a decoupled L2L_2 weight-decay term with fixed strength λ\lambda at every step. The canonical AdamW update, omitting bias corrections, is: mt+1=β1mt+(1β1)gt+1 vt+1=β2vt+(1β2)gt+12 θt+1=θtγ(t)[γmt+1vt+1+ϵ+λθt]\begin{aligned} m_{t+1} &= \beta_1 m_t + (1 - \beta_1) g_{t+1} \ v_{t+1} &= \beta_2 v_t + (1 - \beta_2) g_{t+1}^2 \ \theta_{t+1} &= \theta_t - \gamma(t)\left[\gamma^* \frac{m_{t+1}}{\sqrt{v_{t+1} + \epsilon}} + \lambda \theta_t\right] \end{aligned} where γ(t)\gamma(t) is a scheduled learning rate, γ\gamma^* its peak, and ϵ\epsilon a numerical stabilizer. Across large-scale training tasks, (β1,β2,λ)(\beta_1, \beta_2, \lambda) are almost always held constant (e.g., β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, λ=102\lambda=10^{-2}) (Ferbach et al., 5 Feb 2026).

2. Logarithmic-Time Weight-Decay Schedules

Logarithmic-time (log-time) weight-decay schedules replace the fixed λ\lambda with a function that decays slowly over the course of training. The most common instantiation is the reciprocal rule: λt=ωt\lambda_t = \frac{\omega}{t} where ω\omega is a scale-invariant hyperparameter. To avoid singularities at early steps (λ0\lambda_0 \to \infty), a "warmup" of t0T/10t_0 \sim T/10 is introduced (TT = total training iterations), giving: λt=ωt+T/10\lambda_t = \frac{\omega}{t + T/10} This schedule corresponds to an exponential decay per unit of log-time, i.e., under the substitution τ=logt\tau = \log t, 1/teτ1/t \sim e^{-\tau}, and thus implements a constant attenuation in logarithmic time (Ferbach et al., 5 Feb 2026). An alternative, closely connected to statistical physics models of neural loss landscapes, is: λ(t)=λ0[log(1+t/τ)]γ\lambda(t) = \lambda_0\, [\log(1 + t/\tau)]^{-\gamma} with γ>0\gamma > 0 and τ>0\tau>0 as hyperparameters (Richemond et al., 2019).

Schedule Type Formula for λt\lambda_t Hyperparameters
Reciprocal ("1/t") ω/(t+T/10)\omega/(t + T/10) ω\omega, TT
Logarithmic λ0[log(1+t/τ)]γ\lambda_0 [\log(1 + t/\tau)]^{-\gamma} λ0\lambda_0, γ\gamma, τ\tau
Power-law λ0/(1+t)γ\lambda_0/(1 + t)^{\gamma} λ0\lambda_0, γ\gamma

Both styles achieve a gradual reduction in regularization, matching the decaying information gain and improved signal-to-noise ratio as learning progresses.

3. Theoretical Motivation: Information-Theoretic and Optimization Perspectives

(a) Power-law Context Memory in Language

Information-theoretic studies (Shannon 1951, Hilberg 1990, Takahira 2016) establish that reductions in per-token uncertainty from increasing context length tt follow a power law, i.e., tβ1\propto t^{\beta - 1} with β0.88\beta \approx 0.88. This implies each additional token yields diminishing information, with an effective context horizon that grows sublinearly. Consequently, regularization early in training should be much stronger than late, aligning with the 1/t decay (Ferbach et al., 5 Feb 2026).

(b) Complexity-Matched Weight-Decay

From a statistical physics viewpoint, the loss surface of heavily overparameterized deep nets exhibits isotropic Gaussian properties with critical-point complexity Σ\Sigma, a quadratic form in the normalized energy (loss) ϵ\epsilon and average Hessian eigenvalue λˉ\bar{\lambda}. Complexity gradient descent prescribes that both energy and penalty decay in lockstep, yielding a "matched" schedule for weight-decay. This justifies closing the form of λ(t)\lambda(t) to mirror that of the learning rate, typically via logarithmic or power-law schedules (Richemond et al., 2019).

(c) Riemann-Sum Attenuation

Log-time schedules for λt\lambda_t ensure that each geometric window [T/2,T][T/2, T] contributes with constant attenuation to the parameter update, as opposed to the exponential attenuation of fixed λ\lambda, thus better matching the power-law decay in information content (Ferbach et al., 5 Feb 2026).

4. Stability Properties and Coupling to Momentum

Logarithmic-time weight-decay, when used alone, introduces no novel instabilities to AdamW or SGD. The simple reciprocal rule λtω/t\lambda_t \leftarrow \omega/t requires no damping or adjustment for numerical stability. When pairing log-time decay of λ\lambda with time-varying momentum hyperparameters (β1,β2)(\beta_1, \beta_2)—as in the "ADANA" optimizer—exact coordination and shared warmup offsets (t0=T/10t_0 = T/10) across schedules are required to maintain stability and prevent the gradient-momentum balance from being overwhelmed. However, for weight-decay alone, these constraints are relaxed (Ferbach et al., 5 Feb 2026).

5. Empirical Effects and Efficiency Gains

Implementing 1/t weight-decay within AdamW for decoder-only transformers ("Enoki" models) trained on the FineWeb corpus (Chinchilla regime: 20 tokens/parameter) yields consistent empirical gains:

  • Optimal ω\omega is scale-invariant: A single ω4\omega \approx 4 (determined by a small-scale sweep, e.g., 6-head model) is near-optimal for model sizes from 45M45\,\mathrm{M} to 2.6B2.6\,\mathrm{B} parameters.
  • Compute-efficiency improvement: For identical validation loss, $10$–15%15\% compute savings over fixed-weight-decay AdamW—quantified as

eff=CbaseCnewCnew\text{eff} = \frac{C_\text{base} - C_\text{new}}{C_\text{new}}

where CbaseC_\text{base} is compute required by constant λ\lambda, and CnewC_\text{new} by log-time decay.

  • Efficiency gains increase slightly with model scale: 8%\sim8\% at 45M45\,\mathrm{M} parameters to 12%\sim12\% at 2.6B2.6\,\mathrm{B}.
  • Robustness: The same warmup offset T/10T/10 suffices, no hyperparameter re-tuning required for different model sizes, and the gains persist under optimizer replacement (e.g., AdEMAMix: $7$–12%12\% savings) (Ferbach et al., 5 Feb 2026).

6. Implementation and Practical Guidelines

To deploy logarithmic-time weight-decay:

  • Use decoupled WD (AdamW-type), not inline "L2 regularization."
  • Schedule: λt=ω/(t+t0)\lambda_t = \omega/(t + t_0), prefer t0T/10t_0 \sim T/10.
  • Choose ω\omega on a small model, retain this value across scales.
  • Retain conventional values for (β1,β2,ϵ)(\beta_1, \beta_2, \epsilon), batch size, and learning-rate schedule.
  • If time-varying β1,β2\beta_1, \beta_2 are employed, all schedules must share the same t0t_0.
  • Warmup offset t0t_0 is essential to avoid instability at t0t \to 0 (Ferbach et al., 5 Feb 2026).

For SGD-style training, matched logarithmic or power-law scheduling is easily implemented:

1
2
3
4
5
for t in range(T):
    lr_t = eta0 * (log(1 + t/tau)) ** (-gamma)
    lambda_t = lambda0 * (log(1 + t/tau)) ** (-gamma)
    g = gradient_on_loss(w_t) + 2 * lambda_t * w_t
    w_t_plus_1 = w_t - lr_t * g

7. Summary and Outlook

Logarithmic-time weight-decay—with representative formula λt=ω/(t+T/10)\lambda_t = \omega/(t + T/10)—aligns the optimizer's regularization schedule with the power-law memory structure of natural language and the dynamics predicted by complexity-based loss landscape analysis. It is a computationally effective, scalable, minimally invasive modification requiring only a single transferable hyperparameter, and consistently provides $10$–15%15\% reductions in training compute across transformer scales. This scheme requires no novel stabilization measures, can be incorporated as a drop-in to AdamW and related optimizers, and serves as a robust baseline for further advances in adaptive weight-decay scheduling (Ferbach et al., 5 Feb 2026, Richemond et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Logarithmic-Time Weight-Decay.