Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Published 7 May 2025 in cs.CL | (2505.04519v1)

Abstract: Sparse LLMs with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable LLMs. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art LLMs. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse LLMs with MoE. We also study the behaviors of such models for future reference.

Abstract PDF Upgrade to Chat

Summary

The paper presents a comprehensive methodology for efficiently training the 718-billion parameter Pangu Ultra MoE model on Ascend NPUs using simulation, expert parallelism, and memory optimizations.
Key results demonstrate significant performance improvements, achieving a 30.0% Model Flops Utilization and 1.46 million Tokens Per Second, comparable to state-of-the-art models.
This research provides practical guidance for scalable training of large sparse models on NPUs and theoretical insights into optimizing MoE architectures for future multi-trillion parameter models.

Pangu Ultra MoE: How to Train Large Sparse Mixture of Experts on Ascend NPUs

The paper "Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs" by the Huawei Pangu Team presents a comprehensive methodology for training large-scale sparse LLMs, specifically a Mixture of Experts (MoE) model on Ascend Neural Processing Units (NPUs). With increasing model sizes approaching a trillion parameters, practical challenges arise in optimizing software and hardware to fully utilize these scale networks. This study explores strategies to harness such model scales efficiently on Ascend NPUs, aiming for optimal utilization of computing resources and actualizing the theoretical performance gain.

Key Objectives and Achievements

The authors address the challenges by designing the architecture and system configurations meticulously. Here are the primary objectives and achievements outlined in the paper:

Model Configuration and Simulation: The study introduces a simulation-driven method for selecting optimal model configurations that align with Ascend NPUs’ capabilities. This approach allows quick exploration of the trade-offs among various model hyperparameters without the overhead of costly real-world experiments. The resulting model configuration, Pangu Ultra MoE, encompasses 718 billion parameters.
Expert Parallelism and Communication Optimization: The research explores Expert Parallelism to address synchronization overhead during training across 6,000 NPUs. By implementing hierarchical communication strategies, the system effectively segregates intra-node and inter-node traffic, optimizing bandwidth utilization.
Memory Efficiency Optimizations: Refined strategies for memory efficiency help mitigate the high memory utilization and parameter management overhead typical in large MoE models. Techniques such as fine-grained recomputation and tensor swapping are applied, keeping memory usage within manageable bounds during the training process.
Load Balancing and Token Behavior Analysis: Load balancing is significantly optimized through adaptive strategies that predict expert load and adaptively place tokens, avoiding severe bottlenecks. Additionally, a meticulous analysis of the model behavior during training offers insights that improve future MoE implementations.

Numerical Results and Comparative Evaluation

The paper reports significant improvements, including an achieved Model Flops Utilization (MFU) of 30.0%, which is notably higher than the baseline. Furthermore, Tokens Per Second (TPS) reached 1.46 million, a marked improvement over previously reported baselines. The research further highlights the comparable performance of Pangu Ultra MoE to existing state-of-the-art models like DeepSeek R1, particularly in medical benchmarks, exemplifying strong domain-specific performance.

Implications and Future Work

The implications of this research are manifold. Practically, the methodologies proposed support scalable and efficient training of expansive MoE models on NPUs, effectively bridging the gap between theoretical potential and practical implementation. Theoretically, the analyses accentuate the importance of expert specialization and load balancing in training sparse models, fostering more nuanced understandings of MoE architectures. As AI development advances, the paper lays foundational guidance for training even more complex multi-trillion parameter models with increased efficiency and resource optimization.

Conclusion

In summary, this paper presents a fundamentally effective approach to training large-scale sparse models on Ascend hardware and elaborates on strategies to optimize both architecture selection and system processes. Through various system optimizations and rigorous simulation methodologies, this work exemplifies a robust framework potentially adaptable for similar high-caliber AI model training, paving the way for future advancements in AI capabilities leveraging hardware-specific configurations.