MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Published 3 Apr 2025 in cs.DC and cs.LG | (2504.02263v3)

Abstract: Mixture-of-Experts (MoE) showcases tremendous potential to scale LLMs with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MegaScale-Infer, a system that uses disaggregated expert parallelism to separate attention and FFN modules for optimized serving of Mixture-of-Experts models.
MegaScale-Infer employs a ping-pong pipeline parallelism strategy and a custom M2N communication library to minimize idle time and reduce communication overhead between modules.
Experimental results show MegaScale-Infer achieving up to 1.90 X higher per-GPU throughput and 1.7 X better throughput per dollar compared to state-of-the-art systems.

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

The paper "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" addresses critical challenges in the efficient large-scale serving of Mixture-of-Experts (MoE) models, focusing on enhancing GPU utilization and reducing operational costs. The MoE archutecture, which dynamically routes input tokens to a subset of feed-forward networks (FFNs), traditionally shifts from compute-intensive to memory-intensive during inference, thereby impacting resource efficiency.

Key Contributions

Disaggregated Expert Parallelism: MegaScale-Infer introduces a novel disaggregation of attention and FFN modules within each model layer. This separation allows for independent tuning and scaling strategies for each module, optimizing for the specific operational characteristics of memory-intensive attention and compute-intensive FFNs.
Ping-Pong Pipeline Parallelism: The system implements a ping-pong pipeline strategy that divides a request batch into micro-batches that alternate between attention and FFN computations. This approach effectively seeks to minimize idle times and communication overhead, thereby maximizing GPU throughput.
Custom M2N Communication Library: A high-performance communication library is developed to facilitate efficient data flow between the disaggregated modules by eliminating unnecessary GPU-to-CPU data transfers and synchronization delays. The library significantly reduces communication latency and operational overhead.

Experimental Results

MegaScale-Infer demonstrates substantial performance improvements over state-of-the-art LLM serving systems. It achieves up to 1.90× higher per-GPU throughput and improves throughput per dollar by 1.7× in a heterogeneous deployment setup. Moreover, the customized communication library shows impressive gains with 4.2× higher throughput and a 68.2% reduction in latency compared to existing communication libraries like NCCL.

Discussion and Implications

The deployment of MegaScale-Infer highlights distinct advantages in serving large-scale MoE models more efficiently than existing methods. By separately optimizing for attention and FFN modules and utilizing heterogeneous deployment strategies, MegaScale-Infer demonstrates how architecture-specific optimizations can lead to significant cost-performance benefits. This contributes to the theoretical discourse on efficient deep learning model serving and presents practical improvements for AI applications reliant on MoE architectures.

Future developments in this area may leverage MegaScale-Infer's strategies to further refine resource allocation and model serving efficiency. Continuous advances in model parallelism and communication strategies promise even greater optimization potential, lending valuable insights into how AI infrastructure can scale effectively while managing cost constraints.