Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

Published 24 Feb 2024 in cs.CV and cs.LG | (2403.08821v1)

Abstract: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.

Abstract PDF HTML Upgrade to Chat

References (34)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces vSAM, which adaptively samples the projected second-order gradient (PSF) to reduce redundant gradient computations in Sharpness-Aware Minimization.
The paper demonstrates that vSAM attains comparable or superior accuracy on models like ResNet-18 and WideResNet by leveraging a dynamic sampling strategy.
The paper’s method yields significant speed-ups, reducing training time by approximately 40% while maintaining robust model generalization.

Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness Aware Minimization

Introduction

The paper introduces an innovative approach to accelerate Sharpness-Aware Minimization (SAM), which balances the training loss and loss sharpness to enhance model generalization. SAM traditionally suffers from computational inefficiency due to its requirement to compute gradients twice per optimization step. This paper proposes an efficient sampling method that adapts based on the variation of the second-order gradient projections, known as the Projected Second-order Gradient (PSF), during the optimization process. This results in a new variant called variation-based SAM (vSAM) that achieves comparable accuracies with a significant reduction in training time.

SAM and Gradient Decomposition

SAM aims to find parameter values corresponding to flat minima to achieve better generalization performance. However, the standard SAM algorithm's double computation of gradients to determine perturbations increases its time complexity compared to traditional Stochastic Gradient Descent (SGD). The authors identify that the SAM gradient can be decomposed into the SGD gradient and the PSF. The primary innovation is recognizing that the PSF's rate of change can be exploited to skip unnecessary computations without sacrificing accuracy.

Figure 1: Accuracy vs training speed of SGD, SAM, LookSAM, ESAM, SAF and vSAM (Our). Every connected line represents a method that trains WideResNet-28-10 and PyramidNet-110 models on CIFAR-100. vSAM substantially accelerates training with almost no reduction in accuracy.

Adaptive Sampling Strategy

A pivotal aspect of vSAM is the introduction of a dynamic sampling strategy based on PSF's variance. As the PSF changes at varying rates throughout training, the authors propose adaptively sampling the PSF less frequently when it changes slowly and more frequently when it changes rapidly. This adaptive strategy allows for significant computational savings, as shown by the 40% acceleration in training speed without compromising model generalization performance.

Implementation Details

The vSAM algorithm involves:

Calculating the SGD gradient.
Evaluating whether a PSF computation is needed based on its historical variance and magnitude relative to the SGD gradient.
If a PSF calculation is deemed necessary, compute it and update the model parameters using the full SAM gradient.
Reuse the last computed PSF in iterations where re-calculation is unnecessary.

This implementation leverages the stochastic nature of the variance in the L2-norm of the PSF, effectively anticipating when full recalculations can be avoided.

Figure 2: Resnet-18

Experimental Results

The experiments conducted on CIFAR-10 and CIFAR-100 datasets using different architectures (e.g., ResNet-18, WideResNet-28-10, PyramidNet-110) demonstrate that vSAM achieves comparable, or even better accuracies compared to SAM, while significantly reducing training time. Specifically, models trained with vSAM reached state-of-the-art accuracy levels with reduced computational costs, thus validating the effectiveness of adaptive PSF sampling.

Implications and Future Directions

The proposed method achieves a critical balance between optimization efficiency and generalization performance, making vSAM a compelling choice for various applications in AI and machine learning, especially those requiring large-scale training under resource constraints. The adaptive sampling approach introduces a versatile framework to be potentially applied to other optimization problems where redundant calculations can be bypassed without detriment.

Conclusions

The adaptive sampling of gradients as proposed in vSAM offers a promising enhancement over traditional SAM by strategically reducing computational overhead while maintaining model performance. Future work could explore extending this methodology to other types of gradients or regularization terms, thereby broadening the applicability of this approach in more generalized contexts across various deep learning architectures and tasks.

Markdown Report Issue