Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Published 13 Aug 2022 in cs.LG and math.OC | (2208.06677v5)

Abstract: In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then, Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $\mathcal{O}(\epsilon^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, etc., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.

Abstract PDF Upgrade to Chat

Citations (116)

View on Semantic Scholar

Summary

The paper introduces an adaptive optimization algorithm that reformulates Nesterov momentum to improve deep model training speed with minimal overhead.
It achieves faster convergence and lower training epochs compared to AdamW and LAMB on tasks like image classification and NLP benchmarks.
Empirical and ablation studies validate Adan’s robustness and efficiency across diverse deep learning architectures and large-batch settings.

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Introduction

The paper discusses Adan, an adaptive optimization algorithm for training deep neural networks (DNNs) that reformulates the standard Nesterov momentum approach to improve optimization efficiency. By integrating Nesterov acceleration into adaptive gradient methods, Adan aims to enhance the convergence speed and robustness across various architectures and training conditions. It specifically targets optimization inefficiencies due to varying optimizer performance across different DNN architectures and the challenges posed by large-batch training.

Algorithm Overview

Adan modifies the traditional adaptive gradient update by employing a new Nesterov momentum estimation, which allows gradient estimation without additional computational overhead at extrapolation points. The algorithm computes the following update rules:

⟨ Initialization ⟩
theta_0, eta, beta_1, beta_2, beta_3
m_0 = g_0
n_0 = g_0^2

⟨ Main Iteration ⟩
while k < K:
    g_k = compute_gradient(theta_k)
    m_k = (1 - beta_1) * m_{k-1} + beta_1 * (g_k + (1 - beta_1) * (g_k - g_{k-1}))
    v_k = (1 - beta_2) * v_{k-1} + beta_2 * (g_k - g_{k-1})
    n_k = (1 - beta_3) * n_{k-1} + beta_3 * (g_k + (1 - beta_2) * (g_k - g_{k-1}))^2
    theta_{k+1} = theta_k - eta * m_k / (sqrt(n_k) + epsilon)

Theoretical Insights

Adan's complexity for finding an $\epsilon$ -approximate stationary point matches the known lower bounds for nonconvex optimization both under Lipschitz gradient and Lipschitz Hessian settings. The analysis shows that Adan can achieve faster convergence compared to existing algorithms such as Adam and AMSGrad, primarily due to its leveraging of Nesterov acceleration's "look ahead" property and dynamic regularization capabilities.

Empirical Performance

Adan is empirically validated across a variety of tasks, including image classification (CNNs, ViTs), NLP tasks (LSTMs, BERT), and reinforcement learning (PPO on MuJoCo). Adan consistently outperforms state-of-the-art optimizers on these tasks.

Image Classification: Adan exhibits superior convergence speed, achieving comparable or improved accuracy with fewer training epochs compared to AdamW and LAMB.
NLP Tasks: It demonstrates robust performance on GLUE benchmarks with BERT and maintains lower perplexity on sequence tasks with Transformer models.
Reinforcement Learning: Adan enhances performance metrics when used in conjunction with PPO, indicating better stability and efficiency.

Ablation Studies

Ablation studies show that Adan is robust to variations in its hyperparameters (e.g., momentum coefficients), indicating the potential for minimal tuning effort in practical settings. The algorithm also maintains strong performance across different training settings and large-batch scenarios.

Figures

Figure 1: Training and test curves of various optimizers on ImageNet dataset. Training loss is larger due to its stronger data argumentation.

Figure 2: Comparison of PPO and our PPO-Adan on several RL games simulated by MuJoCo. Here PPO-Adan simply replaces the Adam optimizer in PPO with our Adan and does not change others.

Figure 3: Effects of momentum coefficients $(\beta_1,\beta_2,\beta_3)$ to top-1 accuracy (\%) of Adan on ViT-B under MAE training framework (800 pretraining and 100 fine-tuning epochs on ImageNet).

Conclusion

Adan's integration of Nesterov acceleration with adaptive gradient estimation offers a powerful optimization framework for deep learning models, capable of handling diverse architectures and large-batch regimes. Its theoretical soundness and empirical robustness make it a versatile optimizer that can alleviate the need to carefully select different algorithms for different models or conditions.