Full-Stack Allreduce on Multi-Rail Networks
Abstract: The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58\% to 87\% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks.
- E. Chan, M. Heimlich, A. Purkayastha, and R. Van De Geijn, “Collective communication: theory, practice, and experience,” CCPE, vol. 19, no. 13, pp. 1749–1783, 2007.
- S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” in ACM PPoPP, 2022, pp. 135–149.
- S. Zhao, F. Li, X. Chen, T. Shen et al., “Naspipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism,” in ASPLOS, 2022, pp. 374–387.
- J. Huang, P. Majumder, S. Kim et al., “Communication algorithm-architecture co-design for distributed deep learning,” in IEEE ISCA, 2021, pp. 181–194.
- El Capitan, “El capitan: Preparing for nnsa’s first exascale machine,” https://asc.llnl.gov/exascale/el-capitan, 2024.
- Frontier, “Frontier,” https://www.olcf.ornl.gov/frontier/, 2022.
- ThetaGPU, “Theta/thetagpu,” https://www.alcf.anl.gov/alcf-resources/theta, 2024.
- S. S. Vazhkudai, B. R. de Supinski, A. S. Bland, A. Geist, J. Sexton, J. Kahle et al., “The design, deployment, and evaluation of the coral pre-exascale systems,” in IEEE SC, 2018, pp. 661–672.
- “TOP500 supercomputer list - november 2023,” https://www.top500.org/lists/top500/2023/11/.
- Y. Ajima, T. Kawashima, T. Okamoto, N. Shida, K. Hirai, T. Shimizu, S. Hiramoto, Y. Ikeda, T. Yoshikawa, K. Uchida, and T. Inoue, “The tofu interconnect d,” in IEEE CLUSTER, 2018, pp. 646–654.
- R. L. Graham, D. Bureddy, P. Lui, Rosenstock et al., “Scalable hierarchical aggregation protocol (sharp): A hardware architecture for efficient data reduction,” in IEEE COMHPC, 2016, pp. 1–10.
- X.-K. Liao, Z.-B. Pang, K.-F. Wang, Y.-T. Lu, M. Xie, and others., “High performance interconnect network for tianhe system,” JCST, vol. 30, no. 2, pp. 259–272, 2015.
- Meta, “Gloo,” https://github.com/facebookincubator/gloo, 2017.
- A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep learning in tensorflow,” arXiv:1802.05799, 2018.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015, pp. 1–14.
- A. Krizhevsky et al., “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1106–1114.
- Intel Corporation, “Intel Performance Counter Monitor (PCM),” https://github.com/intel/pcm, 2024.
- F. Petrini, W. chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg, “The quadrics network (qsnet): high-performance clustering technology,” in HOTI, 2001, pp. 125–130.
- T. Schneider, T. Hoefler, R. E. Grant, B. W. Barrett, and R. Brightwell, “Protocols for fully offloaded collective operations on accelerated network adapters,” in ICPP, 2013, pp. 593–602.
- A. Sapio, M. Canini et al., “Scaling distributed machine learning with in-network aggregation,” in NSDI, 2021, pp. 785–808.
- C. Lao, Y. Le, K. Mahajan, Y. Chen, W. Wu, A. Akella, and M. Swift, “ATP: In-network aggregation for multi-tenant learning,” in NSDI, 2021, pp. 741–761.
- Z. Li, J. Huang, Y. Li, A. Xu, S. Zhou, J. Liu, and J. Wang, “A2tp: Aggregator-aware in-network aggregation for multi-tenant learning,” in EuroSys, 2023, p. 639–653.
- B. Ramesh, K. K. Suresh, N. Sarkauskas, M. Bayatpour, and others., “Scalable mpi collectives using sharp: Large scale performance evaluation on the tacc frontera system,” in ExaMPI, 2020, pp. 11–20.
- M. G. Venkata, G. Bloch, G. Shainer, and R. Graham, “Accelerating openshmem collectives using in-network computing approach,” in SBAC-PAD, 2019, pp. 212–219.
- A. Paszke, S. Gross, F. Massa et al., “Pytorch: An imperative style, high-performance deep learning library,” in NIPS, 2019, pp. 1–12.
- T. Chen, M. Li, Y. Li et al., “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv:1512.01274, 2015.
- M. Abadi, P. Barham et al., “Tensorflow: A system for large-scale machine learning,” in USENIX OSDI, 2016, pp. 265–283.
- NVIDIA, “NVIDIA Collective Communication Library (NCCL),” https://developer.nvidia.com/nccl, 2017.
- S. Wang, J. Wei, A. Sabne, A. Davis et al., “Overlap communication with dependent computation via decomposition in large deep learning models,” in ASPLOS, 2022, pp. 93–106.
- S. Li et al., “Chimera: efficiently training large-scale neural networks with bidirectional pipelines,” in ACM SC, 2021, pp. 1–14.
- X. Wu, H. Xu et al., “Stanza: Layer separation for distributed training in deep learning,” IEEE TSC, vol. 15, pp. 1309–1320, 2022.
- A. M Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini, “An efficient statistical-based gradient compression technique for distributed training systems,” MLSys, vol. 3, pp. 297–322, 2021.
- C. Renggli et al., “Sparcml: High-performance sparse communication for machine learning,” in ACM SC, 2019, pp. 1–15.
- D. Alistarh, D. Grubic et al., “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” in NIPS, 2017, pp. 1–12.
- W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” in NIPS, 2017, pp. 1509–1519.
- Z. Wang, H. Lin, Y. Zhu, and T. S. E. Ng, “Hi-speed dnn training with espresso: Unleashing the full potential of gradient compression with near-optimal usage strategies,” in EuroSys, 2023, p. 867–882.
- Y. You, J. Hseu, C. Ying et al., “Large-batch training for lstm and beyond,” in ACM SC, 2019, pp. 1–16.
- Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, “Fast deep neural network training on distributed systems and cloud tpus,” IEEE TPDS, vol. 30, no. 11, pp. 2449–2462, 2019.
- Y. You, J. Li, S. Reddi et al., “Large batch optimization for deep learning: Training bert in 76 minutes,” in ICLR, 2019, pp. 1–38.
- H. Xu, W. Zhang, J. Fei, Y. Wu, T. Xie, J. Huang, Y. Xie, M. Elhoseiny, and P. Kalnis, “SLAMB: Accelerated large batch training with sparse communication,” in ICML, 2023, pp. 38 801–38 825.
- T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local sgd,” in ICLR, 2019, pp. 1–40.
- D. Cai, Y. Wu, S. Wang, F. X. Lin, and M. Xu, “Efficient federated learning for modern nlp,” in MobiCom, 2023.
- J. Li, F. Huang, and H. Huang, “Communication-efficient federated bilevel optimization with global and local lower level problems,” in NIPS, vol. 36, 2023, pp. 1326–1338.
- Y. Guo, X. Tang, and T. Lin, “FedBR: Improving federated learning on heterogeneous data via local learning bias reduction,” in ICML, 2023, pp. 12 034–12 054.
- M. Li, D. G. Andersen, J. W. Park et al., “Scaling distributed machine learning with the parameter server,” in USENIX OSDI, 2014, pp. 583–598.
- A. Gibiansky, “Bringing hpc techniques to deep learning,” Baidu Research, Tech. Rep., 2017.
- P. Sanders, J. Speck, and J. L. Träff, “Two-tree algorithms for full bandwidth broadcast, reduction and scan,” PARCO, vol. 35, no. 12, pp. 581–594, 2009.
- J. Sanghoon, H. Son, and J. Kim, “Logical/physical topology-aware collective communication in deep learning training,” in IEEE HPCA, 2023, pp. 56–68.
- S. Wang, J. Geng, and D. Li, “Impact of synchronization topology on dml performance: Both logical topology and physical topology,” IEEE TON, vol. 30, no. 2, pp. 572–585, 2021.
- C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, and others., “Bcube: a high performance, server-centric network architecture for modular data centers,” SIGCOMM Comput. Commun. Rev., vol. 39, no. 4, p. 63–74, 2009.
- N. Wolfe, M. Mubarak, N. Jain, J. Domke, A. Bhatele, C. D. Carothers, and R. B. Ross, “Preliminary performance analysis of multi-rail fat-tree networks,” in CCGRID, 2017, pp. 258–261.
- Y. Wang, D. Dong, and F. Lei, “Mr-tree: A parametric family of multi-rail fat-tree,” in IEEE IPCCC, 2021, pp. 1–9.
- N. Jain, A. Bhatele, L. H. Howell, D. Böhme, and others., “Predicting the performance impact of different fat-tree configurations,” in ACM SC, 2017, pp. 1–13.
- B. He, J. Wang, Q. Qi, H. Sun, J. Liao, L. Lu, and Z. Han, “Learning-based real-time transmission control for multi-path tcp networks,” TCCN, vol. 9, no. 5, pp. 1353–1369, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.