Damped-Cosine Learning Rate Schedule

Updated 5 February 2026

Damped-cosine learning rate schedule is a variation of cosine annealing that adds a polynomial k-decay term to refine the learning rate decay profile during training.
It modulates the training by dampening the mid-training phase and triggering a sharper reduction towards the end, which can improve optimization in vision benchmarks.
The approach is computationally efficient and requires careful tuning of the k hyperparameter to balance performance across shallow and deep network architectures.

A damped-cosine learning rate schedule is a modification of the standard cosine learning rate annealing strategy, augmented by the inclusion of a single “k-decay” term. This approach introduces a polynomially controlled dampening to the decay profile with the aim of improving neural network training performance by more finely tuning the rate of learning rate reduction throughout the optimization process. The entire schedule is governed by one additional hyperparameter $k$ and is analytically simple to implement, incurring negligible computational cost. The damped-cosine schedule and its properties have been rigorously analyzed and empirically validated in multiple vision benchmarks (Zhang et al., 2020).

1. Mathematical Formulation

The canonical monotonic cosine learning rate (LR) decay is given by

$\eta_0(t) = \frac{1}{2} (\eta_0 - \eta_e) \left[1 + \cos\left(\frac{\pi t}{T}\right)\right] + \eta_e$

for $t \in [0, T]$ , where $\eta_0$ is the initial learning rate, $\eta_e$ the final value, and $T$ the total number of steps.

The damped-cosine modification introduces an additive $k$ -decay term:

$\eta'(t) = \frac{1}{2} (\eta_0 - \eta_e) \left[1 + \cos\left(\frac{\pi t}{T}\right)\right] + \eta_e + (\eta_0 - \eta_e)\left(\frac{t^k}{T^k} - \frac{t}{T}\right)$

Or, factoring $(\eta_0 - \eta_e)$ :

$\eta'(t) = (\eta_0 - \eta_e)\left[ \frac{1}{2}(1 + \cos\left(\frac{\pi t}{T}\right)) + \frac{t^k}{T^k} - \frac{t}{T} \right] + \eta_e$

For $k = 1$ , the $k$ -decay term vanishes; the schedule reduces to standard cosine decay. For $k > 1$ , the learning rate curve is "damped," exhibiting a flatter slope mid-training and a sharper reduction toward the end.

2. Properties and Hyperparameter Effects

The $k$ -decay term, defined as $\delta(t) = \frac{t^k}{T^k} - \frac{t}{T}$ , modifies the higher-order derivatives of the learning rate curve:

At $t=0$ and $t=T$ , $\delta = 0$ , ensuring the endpoints remain fixed.
$(d\delta/dt) = k t^{k-1} / T^k - 1/T$ . For low $t$ , the derivative is negative (further suppressing LR early); for high $t$ , it becomes positive (producing a steeper drop in the tail).

Numerically, increasing $k$ dampens the LR during mid-training and steepens its decay near the end, raising the rate of change (ROC) in the final epochs. The authors argue that a larger ROC at the end correlates with sharper training loss reduction.

Varying $k$ systematically changes the learning rate dynamics:

For shallow nets, larger $k$ is sometimes advantageous, while deeper nets require smaller $k$ to avoid excessively low LRs during most of training.
There exists a critical value $k_\text{opt}$ where test error is minimized; beyond this, further increases degrade performance due to over-dampening.

Recommended Ranges

Network/Vision Task	Optimal $k$ (observed)	Recommendation
CIFAR-10 Wide ResNet-28-10	Up to $k = 1.5$	$k \in [1.0, 2.0]$
CIFAR-100 Wide ResNet-28-10	$k = 5.0$
ResNet-101 (deeper)	$k_\text{opt} \approx 3$
ResNet-47 (shallower)	$k_\text{opt} \approx 7$

In practice, a grid search on a held-out set is advocated, e.g. $k \in \{1.0, 1.5, 2.0, \dots\}$ , with $k \approx 1.5$ serving as a default (Zhang et al., 2020).

3. Implementation and Practical Considerations

The damped-cosine schedule requires only minor code modification relative to the base cosine schedule. In PyTorch, this involves extending the per-step learning rate lambda by one power term and a subtraction. An explicit example (as from the original source):

import math
from torch.optim.lr_scheduler import LambdaLR

def damped_cosine(k, total_steps, eta0, etae):
    def lr_lambda(step):
        t = float(step)
        T = float(total_steps)
        c = 0.5 * (1 + math.cos(math.pi * t / T))
        delta = (t ** k) / (T ** k) - (t / T)
        return (eta0 - etae) * (c + delta) / (eta0 - etae) + (etae / eta0)
    return lr_lambda

optimizer = torch.optim.SGD(model.parameters(), lr=eta0, momentum=0.9)
scheduler = LambdaLR(optimizer, lr_lambda=damped_cosine(
                               k=1.5,
                               total_steps=num_batches * epochs,
                               eta0=0.1, etae=0.001))

Computational overhead is negligible, as the required operations are amortized and do not impact training run time.

The method assumes that the total number of steps $T$ is known in advance—necessary for the analytic form of the extra polynomial term.

4. Empirical Performance and Benchmark Results

Empirical evaluation on standard vision datasets shows that the damped-cosine schedule can improve test performance, especially when applied to polynomial learning rate schedules (POL), though improvements with cosine bases are sometimes marginal or even slightly negative. Results from (Zhang et al., 2020):

Dataset/Model	Baseline Sched.	COS (error %)	COS + k-decay	POL (error %)	POL + k-decay
CIFAR-10 (WideResNet-28-10, 200ep)	StepDecay	3.68	3.82 (↑0.14%)	—	3.59 (↑0.36%)
CIFAR-100 (WideResNet-28-10)	StepDecay	18.68	18.44 (↑0.24%)	—	18.43 (↑0.99%)
ImageNet (ResNet-50, 90ep)	POL	—	—	24.36	23.11 (↑1.25%)

In many cases, the largest and most consistent gains were observed when $k$ -decay was added to polynomial learning rate schedules, suggesting the extra term is best matched to monotonic decay shapes.

5. Limitations and Failure Cases

Several limitations are notable:

For non-monotonic base schedules (e.g., cosine), the monotonic extra polynomial term can be suboptimal; on CIFAR-10, COS + k-decay slightly worsened accuracy.
Excessively large $k$ over-suppresses learning rate during mid-training, causing learning to stagnate until a steeper tail end, which often cannot recover lost performance.
The optimal $k$ narrows for deeper networks, requiring more precise tuning.
The analytic form assumes $T$ is known and fixed; schedules involving restarts or indefinite training may require problem-specific re-derivation of the polynomial term.

6. Context, Significance, and Future Prospects

The damped-cosine or $k$ -decay learning rate schedule exemplifies the sensitivity of DNN optimization to the precise form of LR decay. The empirical correlation between larger ROC at the end of schedule and sharper loss reduction motivates further mechanistic study. The method is notable for its generality (applicable to various monotonic LR schedules) and computational simplicity (no impact on backpropagation or extra parameters).

A plausible implication is that slightly more elaborate parametric control of learning rate decays—via higher-order polynomial terms—may become standard in tuning regimes for large-scale vision models, especially where analytic schedules are favored over learned or data-driven LR adaptation. However, care must be taken to match the $k$ -decay augmentation to the character of the base schedule, especially in scenarios involving non-monotonic or restart-based LR schedules (Zhang et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

kDecay: Just adding k-decay items on Learning-Rate Schedule to improve Neural Networks (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Damped-Cosine Learning Rate Schedule.