Papers
Topics
Authors
Recent
Search
2000 character limit reached

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Published 29 Oct 2023 in cs.LG and cs.DC | (2310.18859v2)

Abstract: Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE ($\textbf{S}$parsity-$\textbf{i}$nspired $\textbf{D}$ata-$\textbf{A}$ware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to $3.93\times$ throughput increasing, up to $72\%$ latency reduction, and up to $80\%$ GPU memory saving with down to $1\%$ performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15. IEEE, 2022.
  2. Neural machine translation by jointly learning to align and translate. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Quip: 2-bit quantization of large language models with guarantees. arXiv preprint arXiv:2307.13304, 2023.
  5. Towards understanding the mixture-of-experts layer in deep learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=MaYzugDmQV.
  6. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516, 2023.
  7. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
  8. Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro, 2023.
  9. A parallel mixture of svms for very large scale problems. Advances in Neural Information Processing Systems, 14, 2001.
  10. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
  11. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  12. M3vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems, 35:28441–28457, 2022.
  13. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  14. Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
  15. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  16. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  17. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
  18. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  20. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  21. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  22. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference. arXiv preprint arXiv:2303.06182, 2023.
  23. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  24. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5, 2023.
  25. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  26. Pruning large language models via accuracy predictor. arXiv preprint arXiv:2309.09507, 2023.
  27. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023a.
  28. Recyclegpt: An autoregressive language model with recyclable module. arXiv preprint arXiv:2308.03421, 2023b.
  29. Hidden markov decision trees. Advances in neural information processing systems, 9, 1996.
  30. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  31. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  32. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  33. Serving moe models on resource-constrained edge devices via dynamic expert swapping. arXiv preprint arXiv:2308.15030, 2023.
  34. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  35. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp.  6265–6274. PMLR, 2021.
  36. Sparse mixture-of-experts are domain generalizable learners. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=RecZ9nB9Q4.
  37. Symbolic chain-of-thought distillation: Small models can also” think” step-by-step. arXiv preprint arXiv:2306.14050, 2023b.
  38. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222, 2023c.
  39. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  40. Qllm: Accurate and efficient low-bitwidth quantization for large language models. arXiv preprint arXiv:2310.08041, 2023a.
  41. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023b.
  42. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp.  22137–22176. PMLR, 2023c.
  43. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  44. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
  45. Modnn: Local distributed mobile computing system for deep neural network. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp.  1396–1401. IEEE, 2017.
  46. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pp.  1614–1623. PMLR, 2016.
  47. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  48. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
  49. OpenAI. Gpt-4 technical report, 2023.
  50. Deep learning challenges and prospects in wireless sensor network deployment. Archives of Computational Methods in Engineering, pp.  1–24, 2024.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  52. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pp.  18332–18346. PMLR, 2022.
  53. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  54. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  55. Hash layers for large sparse models. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=lMgDDWb1ULW.
  56. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  57. Edge-moe: Memory-efficient multi-task vision transformer architecture with task-level sparsity via mixture-of-experts. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), pp.  01–09. IEEE, 2023.
  58. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
  59. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
  60. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
  61. Se-moe: A scalable and efficient mixture-of-experts distributed training and inference system. arXiv preprint arXiv:2205.10034, 2022.
  62. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  63. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023.
  64. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  65. [industry] gkd: A general knowledge distillation framework for large-scale pre-trained language model. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  66. Tresp, V. Mixtures of gaussian processes. Advances in neural information processing systems, 13, 2000.
  67. Llmzip: Lossless text compression using large language models. arXiv preprint arXiv:2306.04050, 2023.
  68. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  69. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  70. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  71. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879, 2023.
  72. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(2):869–904, 2020.
  73. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  74. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402, 2023.
  75. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
  76. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.  38087–38099. PMLR, 2023.
  77. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
  78. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8779–8787, 2022.
  79. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
  80. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352, 2023.
  81. Distilling script knowledge from large language models for constrained language planning. arXiv preprint arXiv:2305.05252, 2023a.
  82. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023b.
  83. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023.
  84. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE, 107(8):1738–1762, 2019.
Citations (6)

Summary

  • The paper introduces a novel SiDA-MoE approach that dynamically offloads inactive experts to system RAM, reducing GPU memory usage by up to 80%.
  • It employs an offline-trained, data-aware hash function to pre-load active experts, significantly cutting inference latency by up to 72%.
  • The method integrates concurrent hash-building and inference threads, achieving a 3.93x increase in throughput compared to baseline methods.

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Introduction

The paper "SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models" (2310.18859) addresses the challenge of efficiently serving large Mixture-of-Experts (MoE) models under constrained GPU memory conditions. MoE architectures have emerged as a compelling solution for enhancing model capacity without significantly increasing computational overhead, making them suitable for modern large-scale AI tasks. However, these architectures often suffer from inefficient GPU memory utilization due to the inactive status of many model parameters during inference. The paper introduces SiDA-MoE—a novel approach that leverages sparsity and data-awareness to optimize memory usage and improve inference efficiency. Figure 1

Figure 1: Diagram Showcasing the Architecture of MoE-based Transformers. Within each MoE layer only a limited number of experts are activated for inference.

Architecture and Key Contributions

Sparse Expert Activation

A defining feature of MoE architectures is their sparse expert activation, where only a subset of experts is engaged during model inference. This characteristic inherently results in underutilized GPU memory, with dormant parameters occupying substantial space. SiDA-MoE mitigates this inefficiency by exploiting expert activation sparsity, dynamically offloading inactive experts to system RAM, thereby optimizing GPU memory usage. Figure 2

Figure 2: GPU Memory Reduction Rate by SiDA-MoE for Switch Transformers Across Datasets. SiDA-MoE achieves over 60\% and 80\% reduction on SST2 and MRPC for Switch-base-128 and Switch-base-256, respectively.

Data-Aware Hash Function

To address expert activation patterns proactively, SiDA-MoE employs an offline-trained hash function that predicts active experts for incoming token batches before inference begins. This data-aware approach allows SiDA-MoE to preload necessary experts onto the GPU, facilitating efficient inference without interrupting the model's forward pass and significantly reducing inference latency. Figure 3

Figure 3: Overview of SiDA-MoE. SiDA-MoE contains two threads, the inference and hash-building thread, that run concurrently.

Concurrent Hash-Building and Inference Threads

SiDA-MoE harnesses parallel processing through two concurrent threads: the hash-building thread and the inference thread. The hash-building thread constructs expert hash tables, storing activation patterns while the inference thread processes batches using the current hash table's configuration. This parallelism ensures continuous operation and maximizes throughput. Figure 4

Figure 4: Throughput of Different Methods for Switch Transformers Across Datasets. SiDA-MoE achieves outstanding throughput for large MoE models on all three datasets with various sentence lengths.

Experimental Results

The experimental evaluation demonstrates the superior efficiency of SiDA-MoE in terms of both GPU memory usage and model inference speed. SiDA-MoE reduces GPU memory usage by up to 80% on various datasets, including SST2, MRPC, and MultiRC, providing scalable improvements even for models with hundreds of billions of parameters. The approach achieves up to a 3.93x increase in throughput compared to baseline methods, while the latency reduction is as high as 72%, establishing SiDA-MoE as a robust solution for real-time applications with limited resources. Figure 5

Figure 5: Throughput Efficiency Relative to GPU Memory Budget. SiDA-MoE's advantage is particularly pronounced in constrained GPU memory scenarios.

Conclusion

SiDA-MoE introduces a transformative method for deploying large MoE models efficiently under constrained memory conditions. By leveraging sparsity and data-awareness, SiDA-MoE optimizes both memory usage and inference performance, demonstrating significant reductions in latency and improvements in throughput. This research sets a precedent for future exploration in scalable AI model deployment and offers practical guidance for real-world applications requiring large-scale model inference. The implications of SiDA-MoE extend beyond theoretical contributions, suggesting avenues for enhanced hierarchical offloading strategies and improved hash-based expert activation techniques.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.