Damped-Cosine Learning Rate Schedule
- Damped-cosine learning rate schedule is a variation of cosine annealing that adds a polynomial k-decay term to refine the learning rate decay profile during training.
- It modulates the training by dampening the mid-training phase and triggering a sharper reduction towards the end, which can improve optimization in vision benchmarks.
- The approach is computationally efficient and requires careful tuning of the k hyperparameter to balance performance across shallow and deep network architectures.
A damped-cosine learning rate schedule is a modification of the standard cosine learning rate annealing strategy, augmented by the inclusion of a single “k-decay” term. This approach introduces a polynomially controlled dampening to the decay profile with the aim of improving neural network training performance by more finely tuning the rate of learning rate reduction throughout the optimization process. The entire schedule is governed by one additional hyperparameter and is analytically simple to implement, incurring negligible computational cost. The damped-cosine schedule and its properties have been rigorously analyzed and empirically validated in multiple vision benchmarks (Zhang et al., 2020).
1. Mathematical Formulation
The canonical monotonic cosine learning rate (LR) decay is given by
for , where is the initial learning rate, the final value, and the total number of steps.
The damped-cosine modification introduces an additive -decay term:
Or, factoring :
For , the -decay term vanishes; the schedule reduces to standard cosine decay. For , the learning rate curve is "damped," exhibiting a flatter slope mid-training and a sharper reduction toward the end.
2. Properties and Hyperparameter Effects
The -decay term, defined as , modifies the higher-order derivatives of the learning rate curve:
- At and , , ensuring the endpoints remain fixed.
- . For low , the derivative is negative (further suppressing LR early); for high , it becomes positive (producing a steeper drop in the tail).
Numerically, increasing dampens the LR during mid-training and steepens its decay near the end, raising the rate of change (ROC) in the final epochs. The authors argue that a larger ROC at the end correlates with sharper training loss reduction.
Varying systematically changes the learning rate dynamics:
- For shallow nets, larger is sometimes advantageous, while deeper nets require smaller to avoid excessively low LRs during most of training.
- There exists a critical value where test error is minimized; beyond this, further increases degrade performance due to over-dampening.
Recommended Ranges
| Network/Vision Task | Optimal (observed) | Recommendation |
|---|---|---|
| CIFAR-10 Wide ResNet-28-10 | Up to | |
| CIFAR-100 Wide ResNet-28-10 | ||
| ResNet-101 (deeper) | ||
| ResNet-47 (shallower) |
In practice, a grid search on a held-out set is advocated, e.g. , with serving as a default (Zhang et al., 2020).
3. Implementation and Practical Considerations
The damped-cosine schedule requires only minor code modification relative to the base cosine schedule. In PyTorch, this involves extending the per-step learning rate lambda by one power term and a subtraction. An explicit example (as from the original source):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import math from torch.optim.lr_scheduler import LambdaLR def damped_cosine(k, total_steps, eta0, etae): def lr_lambda(step): t = float(step) T = float(total_steps) c = 0.5 * (1 + math.cos(math.pi * t / T)) delta = (t ** k) / (T ** k) - (t / T) return (eta0 - etae) * (c + delta) / (eta0 - etae) + (etae / eta0) return lr_lambda optimizer = torch.optim.SGD(model.parameters(), lr=eta0, momentum=0.9) scheduler = LambdaLR(optimizer, lr_lambda=damped_cosine( k=1.5, total_steps=num_batches * epochs, eta0=0.1, etae=0.001)) |
The method assumes that the total number of steps is known in advance—necessary for the analytic form of the extra polynomial term.
4. Empirical Performance and Benchmark Results
Empirical evaluation on standard vision datasets shows that the damped-cosine schedule can improve test performance, especially when applied to polynomial learning rate schedules (POL), though improvements with cosine bases are sometimes marginal or even slightly negative. Results from (Zhang et al., 2020):
| Dataset/Model | Baseline Sched. | COS (error %) | COS + k-decay | POL (error %) | POL + k-decay |
|---|---|---|---|---|---|
| CIFAR-10 (WideResNet-28-10, 200ep) | StepDecay | 3.68 | 3.82 (↑0.14%) | — | 3.59 (↑0.36%) |
| CIFAR-100 (WideResNet-28-10) | StepDecay | 18.68 | 18.44 (↑0.24%) | — | 18.43 (↑0.99%) |
| ImageNet (ResNet-50, 90ep) | POL | — | — | 24.36 | 23.11 (↑1.25%) |
In many cases, the largest and most consistent gains were observed when -decay was added to polynomial learning rate schedules, suggesting the extra term is best matched to monotonic decay shapes.
5. Limitations and Failure Cases
Several limitations are notable:
- For non-monotonic base schedules (e.g., cosine), the monotonic extra polynomial term can be suboptimal; on CIFAR-10, COS + k-decay slightly worsened accuracy.
- Excessively large over-suppresses learning rate during mid-training, causing learning to stagnate until a steeper tail end, which often cannot recover lost performance.
- The optimal narrows for deeper networks, requiring more precise tuning.
- The analytic form assumes is known and fixed; schedules involving restarts or indefinite training may require problem-specific re-derivation of the polynomial term.
6. Context, Significance, and Future Prospects
The damped-cosine or -decay learning rate schedule exemplifies the sensitivity of DNN optimization to the precise form of LR decay. The empirical correlation between larger ROC at the end of schedule and sharper loss reduction motivates further mechanistic study. The method is notable for its generality (applicable to various monotonic LR schedules) and computational simplicity (no impact on backpropagation or extra parameters).
A plausible implication is that slightly more elaborate parametric control of learning rate decays—via higher-order polynomial terms—may become standard in tuning regimes for large-scale vision models, especially where analytic schedules are favored over learned or data-driven LR adaptation. However, care must be taken to match the -decay augmentation to the character of the base schedule, especially in scenarios involving non-monotonic or restart-based LR schedules (Zhang et al., 2020).