MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
Abstract: The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: https://github.com/EnflameTechnology/DeepSpeed.
- The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783.
- Mixtral of Experts. arXiv preprint arXiv:2401.04088.
- Tutel: Adaptive mixture-of-experts at scale. arXiv preprint arXiv:2206.03382.
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv preprint arXiv:2401.06066.
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. arXiv preprint arXiv:2104.04473.
- GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv preprint arXiv:2310.01889.
- FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 120–134.
- FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion. arXiv preprint arXiv:2406.06858.
- Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingt. arXiv preprint arXiv:2406.07522.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053.
- Jamba: A Hybrid Transformer-Mamba Language Model. arXiv preprint arXiv:2403.19887.
- PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining. In IEEE Conference on Computer Communications.
- Connection-level analysis and modeling of network traffc. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet measurement, 99–103.
- A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training. arXiv preprint arXiv:2303.06318.
- Megablocks: Efficient Sparse Training With Mixture-of-Experts. arXiv preprint arXiv:2211.15841.
- Reducing Activation Recomputation in Large Transformer Models. arXiv preprint arXiv:2205.05198.
- HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System. arXiv preprint arXiv:2203.14685.
- MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism. IEEE Transactions on Parallel and Distributes Systems., 35(6): 998–1011.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.