MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

Published 25 Jan 2024 in cs.LG and cs.PF | (2401.14361v3)

Abstract: This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 3.1-16.7x per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks. MoE-Infinity's source code is publicly available at https://github.com/EfficientMoE/MoE-Infinity

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel offloading technique leveraging sequence-level expert activation tracing to efficiently manage memory constraints in MoE models.
It introduces activation-aware expert prefetching and caching that cut latency by 4-20X and lower deployment costs by over 8X.
Its design ensures robust performance under distribution shifts, offering a scalable solution for diverse and high-throughput MoE architectures.

Introduction

A recently developed system, MOE-INFINITY, introduces a cost-efficient serving strategy for mixture-of-expert (MoE) models. It addresses a critical challenge associated with MoE deployment, which involves the handling of substantial model parameters leading to significant memory costs. MOE-INFINITY presents an innovative solution that navigates the memory constraints using an offloading technique that is distinct from the techniques employed in existing systems.

Sequence-Level Expert Activation Tracing

Many of the current offloading systems target dense models and lack the granularity to cater to MoEs’ specific requirements. MOE-INFINITY addresses this by introducing sequence-level expert activation tracing, which preserves the sparse activation and temporal locality characteristics inherent to MoEs. Through this granular tracing, coupled with an optimized construction algorithm for the Expert Activation Matrix Collection (EAMC), it captures a range of activation patterns across sequences to predict and prefetch necessary experts effectively.

Activation-Aware Expert Prefetching and Caching

The system employs novel activation-aware methodologies in both prefetching and caching of offloaded parameters to significantly reduce latency. For prefetching, MOE-INFINITY considers the likelihood of expert activation and the proximity of an expert's layer to the currently executed layer, enabling efficient prefetching strategies. Caching is approached similarly, where the system allocates priority to experts based on their past activation ratios and their place within the MoE layers, further boosting the cache hit ratio and reducing execution latency.

Evaluation and Performance

MOE-INFINITY outperforms numerous existing systems in both latency and cost. With extensive experimental validation in various conditions – including different models sizes, batch sizes, and dataset characteristics – the system demonstrates substantial latency reduction (4-20X) and a significant decrease in deployment costs (over 8X). Furthermore, its design elements, the EAMC, and continuous refinement of prefetching priorities establish robust performance even under distribution shifts in serving scenarios. MOE-INFINITY's efficiency is enhanced on multi-GPU servers and is expected to improve further with the upcoming wave of serving servers featuring higher inter-memory bandwidth.

Conclusion

In summary, MOE-INFINITY reflects a carefully designed response to the memory-intensive demands of MoE models. By tailoring offloading, prefetching, and caching strategies specifically to leverage MoE structures, it paves the way for high-throughput, cost-effective AI services. The system's strength lies in its ability to adapt to varying MoE architectures and serving workloads, making it an innovative platform for future explorations into MoE-optimized deployment systems.

Markdown Report Issue