DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining
Abstract: Communication scheduling has been shown to be effective in accelerating distributed training, which enables all-reduce communications to be overlapped with backpropagation computations. This has been commonly adopted in popular distributed deep learning frameworks. However, there exist two fundamental problems: (1) excessive startup latency proportional to the number of workers for each all-reduce operation; (2) it only achieves sub-optimal training performance due to the dependency and synchronization requirement of the feed-forward computation in the next iteration. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations, which overlaps with both backpropagation and feed-forward computations without extra communications. We further design a practical tensor fusion algorithm to improve the training performance. Experimental results with five popular models show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions on a 64-GPU cluster with 10Gb/s Ethernet and 100Gb/s InfiniBand interconnects, respectively.
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” in Proc. of NeurIPS, 2012, pp. 1223–1231.
- X. Jia, S. Song, S. Shi, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, and X. Chu, “Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes,” in Proc. of Workshop on Systems for ML and Open Source Software, collocated with NeurIPS 2018, 2018.
- Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” in International Conference on Learning Representations, 2020.
- P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
- Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” Proc. of NeurIPS, vol. 32, 2019.
- D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on GPU clusters using megatron-LM,” in Proc. of SC, 2021, pp. 1–15.
- A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda, “Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning,” in Proc. of The 23rd European MPI Users’ Group Meeting, 2016, pp. 15–22.
- M. Cho, U. Finkler, and D. Kung, “Blueconnect: Novel hierarchical all-reduce on multi-tired network for deep learning,” in Proceedings of the Conference on Systems and Machine Learning (SysML), 2019.
- C.-H. Chu, P. Kousha, A. A. Awan, K. S. Khorassani, H. Subramoni, and D. K. Panda, “NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems,” in Proceedings of the 34th ACM International Conference on Supercomputing, 2020, pp. 1–12.
- S. Shi, Z. Tang, X. Chu, C. Liu, W. Wang, and B. Li, “A quantitative survey of communication optimizations in distributed deep learning,” IEEE Network, vol. 35, no. 3, pp. 230–237, 2020.
- P. Xu, S. Shi, and X. Chu, “Performance evaluation of deep learning tools in docker containers,” in 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM). IEEE, 2017, pp. 395–403.
- S. Shi, W. Qiang, and X. Chu, “Performance modeling and evaluation of distributed deep learning frameworks on GPUs,” in Proc. of The 4th International Conference on Big Data Intelligence and Computing. IEEE, 2018.
- H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters,” in 2017 USENIX Annual Technical Conference (USENIX ATC 17), 2017, pp. 181–193.
- A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, “S-caffe: Co-designing MPI runtimes and Caffe for scalable deep learning on modern GPU clusters,” in Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017, pp. 193–205.
- S. Li, Y. Zhao, R. Varma et al., “Pytorch distributed: Experiences on accelerating data parallel training,” Proc. of VLDB, vol. 13, no. 12.
- A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deep learning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.
- J. Romero, J. Yin, N. Laanait, B. Xie, M. T. Young, S. Treichler, V. Starchenko, A. Borisevich, A. Sergeev, and M. Matheson, “Accelerating collective communication in data parallel training across deep learning frameworks,” in Proc. of NSDI, 2022, pp. 1027–1040.
- V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
- M. Barnett, L. Shuler, R. van de Geijn, S. Gupta, D. Payne, and J. Watts, “Interprocessor collective communication library (intercom),” in Proceedings of IEEE Scalable High Performance Computing Conference, 1994, pp. 357–364.
- R. Rabenseifner, “Optimization of collective reduction operations,” in International Conference on Computational Science. Springer, 2004, pp. 1–9.
- R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” The International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.
- T. Hoefler, W. Gropp, R. Thakur, and J. L. Träff, “Toward performance models of MPI implementations for understanding application scaling issues,” in European MPI Users’ Group Meeting. Springer, 2010, pp. 21–30.
- S. Shi, X. Chu, and B. Li, “MG-WFBP: Efficient data communication for distributed synchronous SGD algorithms,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 2019, pp. 172–180.
- ——, “MG-WFBP: Merging gradients wisely for efficient communication in distributed deep learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 8, pp. 1903–1917, 2021.
- Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, C. Wu, and C. Guo, “A generic communication scheduler for distributed DNN training acceleration,” in Proc. of SOSP, 2019, pp. 16–29.
- M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server.” in OSDI, vol. 14, 2014, pp. 583–598.
- P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117–124, 2009.
- R. Thakur and W. D. Gropp, “Improving the performance of collective operations in MPICH,” in European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting. Springer, 2003, pp. 257–267.
- J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” Proc. of NeurIPS, vol. 25, 2012.
- F. Nogueira, “Bayesian Optimization: Open source constrained global optimization tool for Python,” 2014–. [Online]. Available: https://github.com/fmfn/BayesianOptimization
- O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang, “Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics,” in NSDI, 2017.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. of CVPR, 2009, pp. 248–255.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of CVPR, 2016, pp. 770–778.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proc. of The 31st AAAI, 2017.
- S. Shi, X. Zhou, S. Song, X. Wang, Z. Zhu, X. Huang, X. Jiang, F. Zhou, Z. Guo, L. Xie et al., “Towards scalable distributed training of deep learning on public cloud clusters,” Proceedings of Machine Learning and Systems, vol. 3, pp. 401–412, 2021.
- S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu, “Asynchronous stochastic gradient descent with delay compensation,” in International Conference on Machine Learning, 2017, pp. 4120–4129.
- X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Proc. of NeurIPS, 2015, pp. 2737–2745.
- D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding,” in Proc. of NeurIPS, 2017, pp. 1709–1720.
- Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in International Conference on Learning Representations, 2018.
- S. Shi, K. Zhao, Q. Wang, Z. Tang, and X. Chu, “A convergence analysis of distributed SGD with communication-efficient gradient sparsification.” in IJCAI, 2019, pp. 3411–3417.
- D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations,” in Proc. of NeurIPS, 2019, pp. 14 695–14 706.
- Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li, “Communication-efficient distributed deep learning: A comprehensive survey,” arXiv preprint arXiv:2003.06307, 2020.
- T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–43, 2019.
- P. Sanders, J. Speck, and J. L. Träff, “Two-tree algorithms for full bandwidth broadcast, reduction and scan,” Parallel Computing, vol. 35, no. 12, pp. 581–594, 2009.
- L. Luo, P. West, J. Nelson, A. Krishnamurthy, and L. Ceze, “PLink: Discovering and exploiting locality for accelerated distributed training on the public cloud,” in Proceedings of Machine Learning and Systems 2020, 2020, pp. 82–97.
- G. Wang, S. Venkataraman, A. Phanishayee, N. Devanur, J. Thelin, and I. Stoica, “Blink: Fast and generic collectives for distributed ML,” in Proceedings of Machine Learning and Systems 2020, 2020, pp. 172–186.
- J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng et al., “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 610–622.
- X. Miao, X. Nie, Y. Shao, Z. Yang, J. Jiang, L. Ma, and B. Cui, “Heterogeneity-aware distributed machine learning training via partial reduce,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2262–2270.
- H. Mikami, H. Suganuma, Y. Tanaka, Y. Kageyama et al., “Massively distributed sgd: Imagenet/resnet-50 training in a flash,” arXiv preprint arXiv:1811.05233, 2018.
- Y. You, A. Buluç, and J. Demmel, “Scaling deep learning on GPU and knights landing clusters,” in Proc. of SC, 2017, pp. 1–12.
- S. Shi, X. Chu, and B. Li, “Exploiting simultaneous communications to accelerate data parallel distributed deep learning,” in IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 2021, pp. 1–10.
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in Proc. of SC. IEEE, 2020, pp. 1–16.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.