MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization

Published 1 Nov 2024 in cs.LG and cs.AI | (2411.00662v1)

Abstract: The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800 cards, with an 8K sequence, results in a 13% overall latency performance improvement. Project Page: https://github.com/EnflameTechnology/DeepSpeed.

Abstract PDF HTML Upgrade to Chat

References (19)

Summary

The paper introduces MoNTA, a method that overlaps inter-node AllToAll and intra-node AllGather communications to cut MoE training latency.
It develops a dynamic performance model that selects optimal parallel strategies, achieving an 8x improvement in AllToAll communication under tensor parallelism.
Experimental results on a 16-GPU cluster show a 13% reduction in overall latency when training a 2x70B model, validating its scalability and efficiency.

Network-Traffic-Aware Optimization in Mixture-of-Experts Models: The MoNTA Approach

The paper presents MoNTA (Network-Traffic-Aware Parallel Optimization), a methodological advancement designed to enhance the computational efficiency of Mixture-of-Experts (MoE) models. These models leverage the capabilities of specialized expert networks to scale LLMs effectively. The challenge addressed is the substantial communication overhead inherent in the MoE architectures, especially as the model's scale increases.

Problem Statement and Proposed Solution

MoE models, by design, require extensive inter-node and intra-node communication, particularly during the allocation of tokens to their respective expert networks through AllToAll communication protocols. Existing frameworks inadequately optimize this communication aspect, particularly under tensor parallelism configurations. The authors propose a dual-faceted approach through MoNTA: optimizing the communication pipeline by strategically overlapping inter-node AllToAll and intra-node AllGather operations and designing a performance model that dynamically selects optimal parallelization strategies based on network traffic awareness.

Key Methodologies and Contributions

The paper details several innovative contributions:

Communication-Aware Optimization: MoNTA's strategy intelligently pipelines communication processes to minimize idle time, achieving significant reductions in AllToAll communication latency. By transforming redundant communications within tensor parallelism into a combination of intra-node and inter-node operations, MoNTA leverages high-bandwidth intra-node connections to enhance efficiency.
Performance Modeling: The introduction of a performance model allows MoNTA to determine the most suitable parallel strategy given the communication volume and network topology. This includes precise calculations of communication efficiencies and organizational methods for data slicing and processing.
Conflict Handling and Pipelining Techniques: The paper elaborates on solutions for minimizing conflicts during the communication processes across different parallelism strategies. Techniques such as communication pipelining between AllGather and device-to-device (D2D) copying are presented to further overlap communication and computation tasks.
Scalability in Long-Context Training: To address the different hardware configurations and their impact on MoE training efficiency, the authors propose a cluster expansion strategy, maximizing the use of hardware resources for long-context models.

Experimental Analysis

Rigorous experimental analysis on a 16-GPU A800 cluster demonstrated that MoNTA significantly reduces communication overhead. For instance, the paper reports an 8x improvement in AllToAll communication performance under 8-card tensor parallelism compared to the DeepSpeed baseline. Moreover, a 13% reduction in overall latency performance was achieved when training a 2x70B model.

Implications and Future Directions

The MoNTA approach demonstrates substantial improvements in the efficiency and scalability of MoE models, making it a valuable contribution to distributed training frameworks. By effectively leveraging network topology and addressing inter/intra-node communication dynamics, MoNTA opens pathways to more resource-efficient model training without compromising on the scalability potential of MoE architectures.

Future work suggested by the authors includes refining the performance model based on varying kernel scheduling impacts and extending MoNTA's capabilities to software kernel fusion strategies, potentially enhancing inference performance alongside training.

Within the broader theoretical and practical scope, MoNTA sets a precedent for communication-aware optimization strategies that are integral to efficient, large-scale, distributed AI training. As the performance models and pipelining techniques mature, they could become standard considerations in the development of distributed learning systems, particularly as AI models continue to grow in complexity and size.

Markdown Report Issue