ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Abstract: Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of LLMs on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.
- QSGD: Communication-efficient SGD via gradient quantization and encoding. Advances in neural information processing systems 30 (2017).
- Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.
- Datasheet for the Pile. arXiv:2201.07311Â [cs.CL]
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model. (2022).
- Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749–1783.
- Large scale distributed deep networks. Advances in neural information processing systems 25 (2012).
- Tim Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arXiv preprint arXiv:1511.04561 (2015).
- 8-bit Optimizers via Block-wise Quantization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=shpkpVXzo3h
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- dgx1 2017. NVIDIA DGX-1. https://www.nvidia.com/en-us/data-center/dgx-1/.
- dgx2 2018. NVIDIA DGX-2. https://www.nvidia.com/en-us/data-center/dgx-2/.
- Communication Quantization for Data-Parallel Training of Deep Neural Networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments (Salt Lake City, Utah) (MLHPC ’16). IEEE Press, 1–8.
- Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. ArXiv abs/1811.06965 (2018).
- Infiniband Sharp white paper 2021. NVIDIA InfiniBand Adaptive Routing Technology. https://nvdam.widen.net/s/whmszwfrbt/infiniband-white-paper-adaptive-routing-1846350.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).
- 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed. CoRR abs/2104.06069 (2021). arXiv:2104.06069 https://arxiv.org/abs/2104.06069
- Microsoft. 2020. Turing-NLG: A 17-billion-parameter language model by Microsoft. https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/.
- PipeDream: Generalized Pipeline Parallelism for DNN Training. In ACM Symposium on Operating Systems Principles (SOSP 2019).
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery, New York, NY, USA, Article 58, 15 pages. https://doi.org/10.1145/3458817.3476209
- N NVIDIA. 2017. NVIDIA Collective Communications Library (NCCL).
- Nvidia V100 datasheet 2017. NVIDIA TESLA V100 GPU ACCELERATOR. https://www.penguinsolutions.com/computing/wp-content/uploads/2019/03/penguin-computing-tesla-v100-ds.pdf.
- NVLink 2017. NVIDIA NVLINK. http://www.nvidia.com/object/nvlink.html.
- NVSwitch 2018. NVIDIA NVSWITCH. http://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf.
- Quantization - PyTorch documentation 2023. Quantization - PyTorch documentation. https://pytorch.org/docs/stable/quantization.html.
- Language Models are Unsupervised Multitask Learners. (2019).
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16.
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21).
- 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arXiv preprint arXiv:2201.11990 (2022).
- Nikko Ström. 2015. Scalable distributed DNN training using commodity GPU cloud computing. (2015).
- 1-bit Adam: Communication Efficient Large-Scale Training with Adam’s Convergence Speed. CoRR abs/2102.02888 (2021). arXiv:2102.02888 https://arxiv.org/abs/2102.02888
- DeepSpeed Team and Rangan Majumder. 2020. DeepSpeed: Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/.
- Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49–66.
- Blink: Fast and Generic Collectives for Distributed ML. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020, Inderjit S. Dhillon, Dimitris S. Papailiopoulos, and Vivienne Sze (Eds.). mlsys.org. https://proceedings.mlsys.org/book/299.pdf
- Reducing BERT Pre-Training Time from 3 Days to 76 Minutes. CoRR abs/1904.00962 (2019). arXiv:1904.00962 http://arxiv.org/abs/1904.00962
- MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud. arXiv:2205.00119Â [cs.DC]
- Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7543–7552. http://proceedings.mlr.press/v97/zhao19c.html
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.