Papers
Topics
Authors
Recent
Search
2000 character limit reached

Damped-Cosine Learning Rate Schedule

Updated 5 February 2026
  • Damped-cosine learning rate schedule is a variation of cosine annealing that adds a polynomial k-decay term to refine the learning rate decay profile during training.
  • It modulates the training by dampening the mid-training phase and triggering a sharper reduction towards the end, which can improve optimization in vision benchmarks.
  • The approach is computationally efficient and requires careful tuning of the k hyperparameter to balance performance across shallow and deep network architectures.

A damped-cosine learning rate schedule is a modification of the standard cosine learning rate annealing strategy, augmented by the inclusion of a single “k-decay” term. This approach introduces a polynomially controlled dampening to the decay profile with the aim of improving neural network training performance by more finely tuning the rate of learning rate reduction throughout the optimization process. The entire schedule is governed by one additional hyperparameter kk and is analytically simple to implement, incurring negligible computational cost. The damped-cosine schedule and its properties have been rigorously analyzed and empirically validated in multiple vision benchmarks (Zhang et al., 2020).

1. Mathematical Formulation

The canonical monotonic cosine learning rate (LR) decay is given by

η0(t)=12(η0ηe)[1+cos(πtT)]+ηe\eta_0(t) = \frac{1}{2} (\eta_0 - \eta_e) \left[1 + \cos\left(\frac{\pi t}{T}\right)\right] + \eta_e

for t[0,T]t \in [0, T], where η0\eta_0 is the initial learning rate, ηe\eta_e the final value, and TT the total number of steps.

The damped-cosine modification introduces an additive kk-decay term:

η(t)=12(η0ηe)[1+cos(πtT)]+ηe+(η0ηe)(tkTktT)\eta'(t) = \frac{1}{2} (\eta_0 - \eta_e) \left[1 + \cos\left(\frac{\pi t}{T}\right)\right] + \eta_e + (\eta_0 - \eta_e)\left(\frac{t^k}{T^k} - \frac{t}{T}\right)

Or, factoring (η0ηe)(\eta_0 - \eta_e):

η(t)=(η0ηe)[12(1+cos(πtT))+tkTktT]+ηe\eta'(t) = (\eta_0 - \eta_e)\left[ \frac{1}{2}(1 + \cos\left(\frac{\pi t}{T}\right)) + \frac{t^k}{T^k} - \frac{t}{T} \right] + \eta_e

For k=1k = 1, the kk-decay term vanishes; the schedule reduces to standard cosine decay. For k>1k > 1, the learning rate curve is "damped," exhibiting a flatter slope mid-training and a sharper reduction toward the end.

2. Properties and Hyperparameter Effects

The kk-decay term, defined as δ(t)=tkTktT\delta(t) = \frac{t^k}{T^k} - \frac{t}{T}, modifies the higher-order derivatives of the learning rate curve:

  • At t=0t=0 and t=Tt=T, δ=0\delta = 0, ensuring the endpoints remain fixed.
  • (dδ/dt)=ktk1/Tk1/T(d\delta/dt) = k t^{k-1} / T^k - 1/T. For low tt, the derivative is negative (further suppressing LR early); for high tt, it becomes positive (producing a steeper drop in the tail).

Numerically, increasing kk dampens the LR during mid-training and steepens its decay near the end, raising the rate of change (ROC) in the final epochs. The authors argue that a larger ROC at the end correlates with sharper training loss reduction.

Varying kk systematically changes the learning rate dynamics:

  • For shallow nets, larger kk is sometimes advantageous, while deeper nets require smaller kk to avoid excessively low LRs during most of training.
  • There exists a critical value koptk_\text{opt} where test error is minimized; beyond this, further increases degrade performance due to over-dampening.
Network/Vision Task Optimal kk (observed) Recommendation
CIFAR-10 Wide ResNet-28-10 Up to k=1.5k = 1.5 k[1.0,2.0]k \in [1.0, 2.0]
CIFAR-100 Wide ResNet-28-10 k=5.0k = 5.0
ResNet-101 (deeper) kopt3k_\text{opt} \approx 3
ResNet-47 (shallower) kopt7k_\text{opt} \approx 7

In practice, a grid search on a held-out set is advocated, e.g. k{1.0,1.5,2.0,}k \in \{1.0, 1.5, 2.0, \dots\}, with k1.5k \approx 1.5 serving as a default (Zhang et al., 2020).

3. Implementation and Practical Considerations

The damped-cosine schedule requires only minor code modification relative to the base cosine schedule. In PyTorch, this involves extending the per-step learning rate lambda by one power term and a subtraction. An explicit example (as from the original source):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import math
from torch.optim.lr_scheduler import LambdaLR

def damped_cosine(k, total_steps, eta0, etae):
    def lr_lambda(step):
        t = float(step)
        T = float(total_steps)
        c = 0.5 * (1 + math.cos(math.pi * t / T))
        delta = (t ** k) / (T ** k) - (t / T)
        return (eta0 - etae) * (c + delta) / (eta0 - etae) + (etae / eta0)
    return lr_lambda

optimizer = torch.optim.SGD(model.parameters(), lr=eta0, momentum=0.9)
scheduler = LambdaLR(optimizer, lr_lambda=damped_cosine(
                               k=1.5,
                               total_steps=num_batches * epochs,
                               eta0=0.1, etae=0.001))
Computational overhead is negligible, as the required operations are amortized and do not impact training run time.

The method assumes that the total number of steps TT is known in advance—necessary for the analytic form of the extra polynomial term.

4. Empirical Performance and Benchmark Results

Empirical evaluation on standard vision datasets shows that the damped-cosine schedule can improve test performance, especially when applied to polynomial learning rate schedules (POL), though improvements with cosine bases are sometimes marginal or even slightly negative. Results from (Zhang et al., 2020):

Dataset/Model Baseline Sched. COS (error %) COS + k-decay POL (error %) POL + k-decay
CIFAR-10 (WideResNet-28-10, 200ep) StepDecay 3.68 3.82 (↑0.14%) 3.59 (↑0.36%)
CIFAR-100 (WideResNet-28-10) StepDecay 18.68 18.44 (↑0.24%) 18.43 (↑0.99%)
ImageNet (ResNet-50, 90ep) POL 24.36 23.11 (↑1.25%)

In many cases, the largest and most consistent gains were observed when kk-decay was added to polynomial learning rate schedules, suggesting the extra term is best matched to monotonic decay shapes.

5. Limitations and Failure Cases

Several limitations are notable:

  • For non-monotonic base schedules (e.g., cosine), the monotonic extra polynomial term can be suboptimal; on CIFAR-10, COS + k-decay slightly worsened accuracy.
  • Excessively large kk over-suppresses learning rate during mid-training, causing learning to stagnate until a steeper tail end, which often cannot recover lost performance.
  • The optimal kk narrows for deeper networks, requiring more precise tuning.
  • The analytic form assumes TT is known and fixed; schedules involving restarts or indefinite training may require problem-specific re-derivation of the polynomial term.

6. Context, Significance, and Future Prospects

The damped-cosine or kk-decay learning rate schedule exemplifies the sensitivity of DNN optimization to the precise form of LR decay. The empirical correlation between larger ROC at the end of schedule and sharper loss reduction motivates further mechanistic study. The method is notable for its generality (applicable to various monotonic LR schedules) and computational simplicity (no impact on backpropagation or extra parameters).

A plausible implication is that slightly more elaborate parametric control of learning rate decays—via higher-order polynomial terms—may become standard in tuning regimes for large-scale vision models, especially where analytic schedules are favored over learned or data-driven LR adaptation. However, care must be taken to match the kk-decay augmentation to the character of the base schedule, especially in scenarios involving non-monotonic or restart-based LR schedules (Zhang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Damped-Cosine Learning Rate Schedule.