APOLLO: SGD-like Memory, AdamW-level Performance

Published 6 Dec 2024 in cs.LG, cs.AI, and cs.PF | (2412.05270v4)

Abstract: LLMs are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel low-rank optimizer that leverages random projections to approximate gradient scaling, significantly reducing memory overhead compared to AdamW.
The paper demonstrates that APOLLO achieves improved validation perplexity and approximately 3× throughput on LLaMA models across various rank configurations and hardware setups.
The paper shows that APOLLO-Mini, with tensor-wise rank-1 gradient scaling, enables LLM training on just 12GB memory without compromising accuracy in fine-tuning tasks.

APOLLO: SGD-like Memory, AdamW-level Performance

The paper "APOLLO: SGD-like Memory, AdamW-level Performance" presents a novel optimization approach designed to enhance the memory efficiency of training LLMs while maintaining the performance level typically achieved with the AdamW optimizer. The focus is on addressing the significant memory overhead incurred by AdamW, which can limit scalability and throughput in large-scale model training.

Motivation and Challenges

Training LLMs such as GPT-3 or LLaMA models using the AdamW optimizer requires substantial memory to maintain first and second moment estimates, tripling the memory demand relative to model parameters alone. This requirement significantly increases the resource costs—oftentimes necessitating higher-end GPUs or sacrificing batch size, which in turn affects throughput and scalability. Previous efforts to mitigate memory usage—such as leveraging Singular Value Decomposition (SVD) or low-rank optimizations—resulted in trade-offs in optimizer performance or incurred high computational costs due to operations like SVD.

Proposed Approach

APOLLO introduces a memory-efficient method that approximates gradient scaling for LLM optimization using random projections in a low-rank space. This approach effectively minimizes the optimizer state footprint without the need for expensive SVD operations:

Gradient Scaling Simplification: APOLLO refines the element-wise learning rate updates of AdamW into a structured format suitable for channel-wise or tensor-wise application, reducing sensitivity to noise.
Low-Rank Approximation: The optimizer state is approximated in a low-rank space utilizing random projection matrices, which successfully preserves the variance properties needed for effective learning rate adaptation.
APOLLO-Mini Variant: For extreme memory efficiency, APOLLO-Mini employs tensor-wise gradient scaling within a rank-1 subspace, achieving similar performance to AdamW at a significantly reduced memory cost, akin to SGD.
Figure 1: Overview of the APOLLO optimizer with a focus on memory breakdown comparison against GaLore and the end-to-end training throughput on GPUs.

Experimental Results

Pre-training Performance:

APOLLO demonstrates superior performance in pre-training LLaMA models of varying sizes on the C4 dataset. It delivers a notable reduction in validation perplexity compared to other low-rank and projection-based methods, such as GaLore and Fira, while significantly lowering memory usage.

LLaMA-7B Model: Achieved a validation perplexity better than GaLore with a rank setting of 1024, while APOLLO operated satisfactorily with a rank of 256 or even 1 in APOLLO-Mini configurations.
Throughput Improvements: On an 8×A100-80GB setup, APOLLO achieved around 3× throughput compared to traditional AdamW due to efficient memory usage, enabling larger batch sizes.
Figure 2: Comparison of Validation perplexity on LLaMA-7B.

Fine-tuning and System-Level Benefits:

In fine-tuning tasks, APOLLO maintains competitive accuracy, surpassing full-rank AdamW in common-sense reasoning and MMLU tasks with negligible memory footprints for optimizer states. The combination of APOLLO-Mini with quantization strategies further allows LLaMA-7B pre-training using only 12GB of memory, democratizing access to LLM training on less capable hardware.

Figure 3: Validation perplexity of pretraining LLaMA-350M on the C4 dataset across different stages.

Conclusion

APOLLO provides a compelling solution for memory-efficient LLM training, achieving AdamW-level performance without the associated memory costs. The proposed methodology scales effectively across model sizes and tasks, proving its applicability and robustness. Future work could explore further reductions in computational costs in tandem with developments in hardware efficiency and low-precision computation techniques.