Papers
Topics
Authors
Recent
Search
2000 character limit reached

MegaScale-Infer: Scalable MoE Inference

Updated 22 January 2026
  • MegaScale-Infer is a high-efficiency inference system for large-scale MoE transformers, disaggregating attention and FFN modules to address memory- and communication-bound issues.
  • It employs a hybrid model-parallel and ping-pong micro-batch pipeline execution scheme to maximize throughput and hide communication latencies.
  • It integrates a high-performance M2N communication layer that reduces latency and increases per-GPU throughput by up to 1.90× versus existing dense and sparse baselines.

MegaScale-Infer is a high-efficiency inference system targeting large-scale Mixture-of-Experts (MoE) transformer models. Its core innovation is the architectural disaggregation of computation and memory resources: attention and feed-forward expert (FFN) modules in each transformer layer are served by independently scalable, heterogeneous GPU pools. This explicit separation enables independent parallelism strategies and tailored hardware assignment to address the inference-time memory- and communication-bound bottlenecks induced by MoE sparsity. MegaScale-Infer introduces a hybrid model-parallel and pipeline execution scheme, augmented by a high-performance M2N (Many-to-N) communication layer optimized for low-latency, collective-free GPU interconnects. Empirical evaluations show that MegaScale-Infer achieves up to 1.90× higher per-GPU throughput relative to state-of-the-art dense and sparse model baselines for multi-billion parameter MoE transformers (Zhu et al., 3 Apr 2025).

1. Architectural Disaggregation and Module-Level Parallelism

MegaScale-Infer decomposes each transformer layer into two computationally distinct stages: the multi-head attention module (including query-key-value projections and KV cache updates) and the MoE FFN module (top-K gated expert selection and dense GEMMs). Each stage is mapped onto physically and logically distinct GPU node pools:

  • Attention nodes: Each attention module (and its KV cache) is replicated across nan_a attention GPUs, using a mixture of data-parallel and intra-node tensor-parallel strategies.
  • Expert nodes: Each MoE expert (FFN) instance is assigned to a dedicated expert GPU (expert-parallel + tensor-parallel).

By independently scaling nan_a (attention replicas) and EE (total experts), MegaScale-Infer can match the effective batch size per expert to the sustainable throughput of modern accelerator hardware, counterbalancing the low device utilization caused by MoE's sparse token-expert activation. The resulting flexibility fundamentally separates compute- and memory-scaling for attention and FFN modules, allowing heterogeneous deployment in resource-constrained and cost-optimized multi-GPU settings (Zhu et al., 3 Apr 2025).

Pseudocode for a single-layer inference:

1
2
3
4
5
6
7
8
9
10
11
for each micro-batch mb in 1...m:
    # on attention nodes
    A_mb = Attention(Q, K, V; KV_cache)
    M2N_Send(A_mb)     # scatter to experts
    
    # on expert nodes
    E_mb = FFN_Experts(M2N_Recv(A_mb))
    M2N_Send(E_mb)     # gather back
    
    # on attention nodes
    X_mb = A_mb + M2N_Recv(E_mb)

Let tpatp_a, tpetp_e denote tensor-parallel degrees for attention and expert nodes respectively. Compute balancing is enforced by setting na=argminTa(ba,tpa)Te(be,tpe)n_a = \arg\min |T_a(b_a, tp_a) - T_e(b_e, tp_e)|, with batch-sizes bab_a, beb_e and per-microbatch compute times TaT_a, TeT_e profiled as Tak1ba+k2T_a \approx k_1 b_a + k_2, Tek3be+k4T_e \approx k_3 b_e + k_4. This permits explicit provisioning for asymmetries in hardware, communication topology, and activation sparsity.

2. Ping–Pong Micro-Batch Pipeline Execution

MegaScale-Infer introduces "ping-pong" pipeline parallelism, which interleaves micro-batches between attention and expert nodes to maximize pipeline occupancy and hide communication latencies. The global batch of size BB is subdivided into mm micro-batches, each sequentially processed in an alternating pattern between the disaggregated compute pools.

Let:

  • TaT_a: Attention module compute time per micro-batch
  • TeT_e: Expert module compute time per micro-batch
  • TcT_c: One-way network transmission time per micro-batch
  • Tf=max(Ta,Te)T_f = \max(T_a, T_e): Pipeline stage time

The core requirement for communication/compute overlap is:

  1. TaTeT_a \approx T_e
  2. Tc<TfT_c < T_f
  3. m2(1+Tc/Tf)m \geq 2(1 + T_c/T_f)

Under this regime, the end-to-end per-layer time is Ttotal,layer=Ta+Te+2Tc+(mL1)TfT_{\text{total,layer}} = T_a + T_e + 2T_c + (mL-1)T_f (for LL layers). Empirical ablations demonstrate global throughput increases of 1.9×1.9\times moving from m=1m=1 to m=2m=2, and further gains of $1.10$–1.38×1.38\times for higher micro-batch counts (Zhu et al., 3 Apr 2025).

3. High-Performance M2N Communication Layer

A key bottleneck in MoE inference is the dispatch and collection of sparse token activations between attention and expert compute pools. MegaScale-Infer employs the M2N GPU-to-GPU communication library, which eliminates several inefficiencies found in collective-oriented communication stacks (e.g., NCCL):

  • Direct GPU-to-GPU RDMA: Utilizes GPUDirect RDMA write operations with immediate verbs, avoiding intervening GPU-to-CPU copies.
  • Pre-registration of Communication Buffers: All necessary buffers are pinned and registered at startup, removing per-call group initialization overhead.
  • Kernel-free Send/Recv: No GPU kernels or synchronizations are required for data movement; senders post RDMA writes and receivers poll for completions asynchronously.
  • Congestion Control: The system assigns high-priority ACK queues and tunes congestion control parameters for non-uniform many-to-N traffic patterns.

In microbenchmarks (8×8 all-to-all, 256 KB messages), M2N reduces median latency by 68.2%, p99 latency by 92.9%, and increases throughput by 4.2× compared to NCCL (Zhu et al., 3 Apr 2025).

Table: Key Communication Efficiency Gains of M2N vs. NCCL

Metric Percent Improvement
Median Latency ↓68.2%
P99 Latency ↓92.9%
Throughput ↑4.2×

Compared to NCCL's all-to-all (data volume PSP \cdot S across PP GPUs), M2N restricts transmission to only the activated experts, reducing data movement by an approximate factor of topK/#experts\mathrm{topK}/\#\mathrm{experts} per layer.

4. Empirical Performance Evaluation

MegaScale-Infer was evaluated on both homogeneous and heterogeneous GPU clusters with diverse model architectures, including Mixtral-8×22B and DBRX. Experiments utilized both A100-80GB NVLink clusters and mixed-resource clusters (H20 GPUs for attention, L40S for experts).

Key results:

  • Mixtral-8×22B, DBRX (single-node):
    • vLLM baseline: U0U_0
    • TensorRT-LLM: 1.28×U01.28\times U_0
    • MegaScale-Infer: 2.56×U02.56\times U_0
  • Scaled-MoE (multi-node):
    • vLLM: 0.26×U00.26\times U_0
    • TensorRT-LLM: 1.00×U01.00\times U_0
    • MegaScale-Infer: 1.90×U01.90\times U_0
  • Heterogeneous (cost-normalized throughputs):
    • MegaScale-Infer: 3.24×3.24\times (attention on H20), 1.86×1.86\times (experts on L40S) improvements relative to vLLM/TensorRT-LLM.

Ablation studies determined optimal throughput is achieved for TaTeT_a \approx T_e (balanced compute) and demonstrated that ping-pong micro-batching (with m3m \geq 3) is essential for hiding network and synchronization overhead.

5. Constraints and Future Directions

MegaScale-Infer’s effective deployment requires adequate inter-GPU network bandwidth such that Tc<TfT_c < T_f holds; otherwise, the required number of micro-batches (mm) increases, potentially impacting GEMM efficiency in the expert pool due to reduced batch sizes. The current scheduling and resource allocation planner assumes static sequence lengths and workload stationarity; dynamic, highly variable request streams may necessitate an online, traffic-aware replanner. Expert redundancy for load-balancing employs a greedy mechanism; finer-grained, model-aware predictors could improve balance and further increase throughput in practical workloads.

Potential future work includes extending attention-FFN disaggregation methods to the inference prefill phase, incorporating non-transformer modules (e.g., retrieval-augmented pipelines), and supporting more diverse hardware, such as FPGAs or domain-specific ASICs, in the disaggregated pools. This suggests that MegaScale-Infer’s architectural paradigm of module-level disaggregation and hybrid parallel/resource control may generalize to future heterogeneous inference platforms beyond current GPU-centric clusters (Zhu et al., 3 Apr 2025).

6. Context within the Landscape of Large-Scale Inference Systems

MegaScale-Infer is situated among a generation of inference systems addressing the shift in bottleneck profile introduced by MoE and sparse large-model serving. Prior works such as DeepSpeed Inference (Aminabadi et al., 2022) and EnergonAI (Du et al., 2022) developed tensor, pipeline, and expert-parallel strategies—often with all-to-all collectives and homogenous model partitioning—but did not explicitly disaggregate attention and FFN at inference time. MegaScale-Infer’s two-stage separation and explicit communication/fusion design enable restoration of high GPU utilization specifically in the context of memory-intensive, sparse expert models, yielding up to 1.90×1.90\times speedup in per-GPU throughput.

In relation to software-defined memory hierarchies proposed for DLRM and recommendation models (Ardestani et al., 2021), MegaScale-Infer represents an orthogonal direction: instead of re-engineering host memory and persistency stacks, it restructures intra-model execution and cross-GPU data flow to fully exploit modern high-bandwidth, low-latency interconnects and meet the scaling needs of trillion-parameter MoE transformers.


MegaScale-Infer demonstrates that modular disaggregation, compute/communication overlap via micro-batch ping-pong pipelines, and optimized GPU-to-GPU communication libraries collectively address the GPU underutilization and cost challenges of sparse, mega-scale MoE inference. Its empirical and architectural contributions establish a new reference point for efficient and scalable serving of state-of-the-art sparse transformers at industrial scale (Zhu et al., 3 Apr 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaScale-Infer.