AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

Published 23 Oct 2024 in cs.LG | (2410.17881v2)

Abstract: Training and fine-tuning LLMs come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper proposes an adaptive gradient rank reduction using randomized SVD to maintain full-rank dynamics while lowering memory costs.
It integrates an online low-rank projection with optimizer state transformation, dynamically preserving essential gradient information.
Empirical results on GLUE and C4 datasets demonstrate that AdaRankGrad outperforms methods like LoRA in both accuracy and memory efficiency.

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

The paper "AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning" introduces an innovative technique aimed at addressing the challenges associated with training and fine-tuning LLMs. These models, due to their extensive size, bring substantial memory and computational demands, predominantly due to the storage requirements for the model weights and optimizer states. This paper proposes an approach centered around adaptively reducing the rank of the gradients during optimization steps, specifically using Adam, which is traditionally memory-intensive.

Conceptual Framework and Strategy

In traditional training paradigms, methods like Low-Rank Adaptation (LoRA) introduce low-rank matrices to mitigate memory usage. However, such methods restrict parameter optimization to a low-rank subspace, potentially altering training dynamics unfavorably. AdaRankGrad seeks to maintain full-rank training dynamics while leveraging the empirical observation that the effective rank of LLM gradients tends to decrease over training iterations. The authors prove that as training progresses, these gradients asymptotically approach rank one, providing a theoretical grounding for their method.

Technical Implementation

AdaRankGrad implements an online low-rank projection method for the gradients. This involves an adaptive approach where the rank of the projected gradients is dynamically adjusted to preserve a predefined fraction of the gradient's information content. The method adopts a randomized Singular Value Decomposition (SVD) scheme to efficiently compute the projection matrix, thus optimizing memory usage without compromising model performance.

The training process involves four key steps:

Adaptive Subspace Selection: Utilizing randomized range finding algorithms to determine an optimal low-rank projection for the gradient matrix efficiently.
Moments Subspaces Transformation: Transforming the optimizer states according to the updated projection subspaces.
Low-Rank Optimization: Continuously updating the model parameters with projected gradients, ensuring convergence within the adaptive subspace.
Full-Parameter Update: Applying these updates during the training cycle.

Empirical Evaluation and Results

The paper provides an extensive evaluation on the GLUE benchmark, showcasing improvements in model performance with reduced memory requirements, when compared to both LoRA and GaLore methods. Notably, AdaRankGrad achieves higher accuracy across various tasks while maintaining efficiency in memory usage. The authors also report substantial memory savings when pre-training LLaMA models on the C4 dataset, highlighting AdaRankGrad's applicability to large-scale LLMs.

Implications and Future Directions

AdaRankGrad's approach has significant implications for both theoretical understanding and practical implementation of LLM training. The adaptive nature of the algorithm ensures that memory resources are optimally utilized, potentially enabling the deployment of these large models on consumer-grade hardware. Theoretically, the model offers a deeper look into the gradient dynamics of LLMs, providing pathways for further exploration into low-rank approximation techniques.

Future research could explore extending the AdaRankGrad framework to other optimization methods beyond Adam and investigate alternative algorithms for subspace rank determination. Additionally, the integration of AdaRankGrad with quantization methods could further enhance its efficiency, facilitating more widespread accessibility and deployment of LLMs.

Overall, AdaRankGrad stands as a forward-thinking approach to optimizing LLM training, balancing the demands of computational efficiency and model performance.