SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Published 29 Oct 2023 in cs.LG and cs.DC | (2310.18859v2)

Abstract: Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

Abstract PDF HTML Upgrade to Chat

References (84)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel SiDA-MoE approach that dynamically offloads inactive experts to system RAM, reducing GPU memory usage by up to 80%.
It employs an offline-trained, data-aware hash function to pre-load active experts, significantly cutting inference latency by up to 72%.
The method integrates concurrent hash-building and inference threads, achieving a 3.93x increase in throughput compared to baseline methods.

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Introduction

The paper "SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models" (2310.18859) addresses the challenge of efficiently serving large Mixture-of-Experts (MoE) models under constrained GPU memory conditions. MoE architectures have emerged as a compelling solution for enhancing model capacity without significantly increasing computational overhead, making them suitable for modern large-scale AI tasks. However, these architectures often suffer from inefficient GPU memory utilization due to the inactive status of many model parameters during inference. The paper introduces SiDA-MoE—a novel approach that leverages sparsity and data-awareness to optimize memory usage and improve inference efficiency.

Figure 1: Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.

Architecture and Key Contributions

Sparse Expert Activation

A defining feature of MoE architectures is their sparse expert activation, where only a subset of experts is engaged during model inference. This characteristic inherently results in underutilized GPU memory, with dormant parameters occupying substantial space. SiDA-MoE mitigates this inefficiency by exploiting expert activation sparsity, dynamically offloading inactive experts to system RAM, thereby optimizing GPU memory usage.

Figure 2: GPU Memory Reduction Rate by SiDA-MoE for Switch Transformers Across Datasets. SiDA-MoE achieves over 60\% and 80\% reduction on SST2 and MRPC for Switch-base-128 and Switch-base-256, respectively.

Data-Aware Hash Function

To address expert activation patterns proactively, SiDA-MoE employs an offline-trained hash function that predicts active experts for incoming token batches before inference begins. This data-aware approach allows SiDA-MoE to preload necessary experts onto the GPU, facilitating efficient inference without interrupting the model's forward pass and significantly reducing inference latency.

Figure 3: Overview of SiDA-MoE. SiDA-MoE contains two threads, the inference and hash-building thread, that run concurrently.

Concurrent Hash-Building and Inference Threads

SiDA-MoE harnesses parallel processing through two concurrent threads: the hash-building thread and the inference thread. The hash-building thread constructs expert hash tables, storing activation patterns while the inference thread processes batches using the current hash table's configuration. This parallelism ensures continuous operation and maximizes throughput.

Figure 4: Throughput of Different Methods for Switch Transformers Across Datasets. SiDA-MoE achieves outstanding throughput for large MoE models on all three datasets with various sentence lengths.

Experimental Results

The experimental evaluation demonstrates the superior efficiency of SiDA-MoE in terms of both GPU memory usage and model inference speed. SiDA-MoE reduces GPU memory usage by up to 80% on various datasets, including SST2, MRPC, and MultiRC, providing scalable improvements even for models with hundreds of billions of parameters. The approach achieves up to a 3.93x increase in throughput compared to baseline methods, while the latency reduction is as high as 72%, establishing SiDA-MoE as a robust solution for real-time applications with limited resources.

Figure 5: Throughput Efficiency Relative to GPU Memory Budget. SiDA-MoE's advantage is particularly pronounced in constrained GPU memory scenarios.

Conclusion

SiDA-MoE introduces a transformative method for deploying large MoE models efficiently under constrained memory conditions. By leveraging sparsity and data-awareness, SiDA-MoE optimizes both memory usage and inference performance, demonstrating significant reductions in latency and improvements in throughput. This research sets a precedent for future exploration in scalable AI model deployment and offers practical guidance for real-world applications requiring large-scale model inference. The implications of SiDA-MoE extend beyond theoretical contributions, suggesting avenues for enhanced hierarchical offloading strategies and improved hash-based expert activation techniques.

Markdown Report Issue