Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Published 28 May 2024 in cs.AI | (2405.17741v1)

Abstract: Recent literature has found that an effective method to customize or further improve LLMs is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For efficiency, this switching is implemented with an optimized CUDA kernel, which fuses the merging operations for all LoRA adapters at once. Based on experiments with popular open-source LLMs on common benchmarks, our approach has demonstrated similar accuracy improvement as existing dynamic adapters, while reducing the decoding latency by more than 2.4 times.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774 (2024).
  2. Punica: Multi-Tenant LoRA Serving. arXiv:2310.18547 [cs.DC]
  3. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv abs/1803.05457 (2018).
  4. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG]
  5. LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin. arXiv:2312.09979 [cs.CL]
  6. Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models. arXiv:2403.03432 [cs.CL]
  7. Higher Layers Need More LoRA Experts. arXiv:2402.08562 [cs.CL]
  8. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.10256836
  9. Mixture of Cluster-conditional LoRA Experts for Vision-language Instruction Tuning. arXiv:2312.12379 [cs.CV]
  10. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  11. Parameter-Efficient Transfer Learning for NLP. arXiv:1902.00751 [cs.LG]
  12. LoRA: Low-Rank Adaptation of Large Language Models. ArXiv abs/2106.09685 (2021). https://api.semanticscholar.org/CorpusID:235458009
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
  14. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  15. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  16. MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA based Mixture of Experts. arXiv:2404.15159 [cs.CL]
  17. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190 [cs.CL]
  18. SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification. https://https://huggingface.co/Open-Orca/SlimOrca
  19. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3214–3252. https://doi.org/10.18653/v1/2022.acl-long.229
  20. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339 (2023).
  21. GPT Understands, Too. arXiv:2103.10385 [cs.CL]
  22. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arXiv:1711.05101 [cs.LG]
  23. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  24. MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models. arXiv:2402.12851 [cs.CL]
  25. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS ’24). ACM. https://doi.org/10.1145/3620666.3651335
  26. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In EMNLP.
  27. OpenChat. 2023. ShareGPT4 Dataset. Hugging Face Datasets. https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/blob/main/sharegpt_clean.json Accessed: 2024-05-11.
  28. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv preprint arXiv:1907.10641 (2019).
  29. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations.
  30. Snowflake AI Research Team. 2024. Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open. https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/. Accessed on April 26, 2024.
  31. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4149–4158. https://doi.org/10.18653/v1/N19-1421
  32. The Mosaic Research Team. 2024. Introducing dbrx: A New State-of-the-Art Open LLM. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm. Accessed on April 26, 2024.
  33. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  34. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023). https://api.semanticscholar.org/CorpusID:259950998
  35. Magicoder: Source Code Is All You Need. arXiv preprint arXiv:2312.02120 (2023).
  36. Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks. arXiv:2401.02731 [cs.AI]
  37. Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks. arXiv preprint arXiv:2401.02731 (2024).
  38. xAI. 2024. Open release of grok-1. https://x.ai/blog/grok-os
  39. MoRAL: MoE Augmented LoRA for LLMs’ Lifelong Learning. arXiv preprint arXiv:2402.11260 (2024).
  40. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. arXiv preprint arXiv:2309.12284 (2023).
  41. HellaSwag: Can a Machine Really Finish Your Sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  42. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
Citations (3)

Summary

  • The paper presents LoRA-Switch, which introduces token-wise routing to reduce dynamic adapter overhead and improve inference efficiency.
  • The innovative design fuses adapter switching with an SGMM kernel to consolidate CUDA operations, cutting decoding latency by 2.4 times.
  • The implementation achieves comparable accuracy to standard methods while lowering peak memory usage and streamlining dynamic fine-tuning.

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Introduction

This paper proposes an innovative system-algorithm co-designed architecture, LoRA-Switch, to enhance the efficiency of dynamic adapters in LLMs. Dynamic adapters like Low-Rank Adapters (LoRA) combined with Mixture-of-Experts (MoE) structures present a method to fine-tune LLMs. However, these dynamic adapters incur significant inference latency due to excessive CUDA kernel overhead. LoRA-Switch addresses this by introducing token-wise routing mechanisms and optimized CUDA kernel implementations, significantly reducing decoding latency while maintaining accuracy.

Background and Motivation

Dynamic adapters enhance LLM capabilities by integrating lightweight, conditionally-computed adapters into pretrained models. Despite minimal impact on parameter size, these adapters significantly increase inference latency, often between 250-950%, due to CUDA kernel execution overhead.

Challenge: The primary challenge lies in the significant latency overhead caused by the dynamic adapters (Figure 1). The fragmented CUDA kernel calls during the decoding phase significantly increase execution time, contributing to higher latency. Figure 1

Figure 1: Decoding phase execution time profiling of one dynamic adapters layer in MoRAL~\cite{yang2024moral}.

The Design of LoRA-Switch

LoRA-Switch, depicted in Figure 2, is proposed as a token-wise routing mechanism, diverging from traditional layer-wise and block-wise routing. This novel approach enables more efficient integration with system-level optimizations. Figure 2

Figure 2: Overview of LoRA-Switch.

Model Structure

LoRA-Switch extends adapters only within the linear layers of the pretrained backbone and employs a token-wise routing strategy. Each token is routed through specific adapters determined by the token-wise gating mechanism. This architecture allows efficient parameter integration into the backbone model, reducing overall latency.

Fused Adapter Switching

LoRA-Switch optimizes performance through pre-gated, fused adapter switching, reducing excessive CUDA kernel calls typically required in traditional dynamic adapters. By fusing and merging active adapters before each token's decoding, the approach leverages efficient computational workflows that align closely with GPU resource management.

SGMM Kernel Implementation

The SGMM kernel is designed to consolidate GEMM operations across multiple adapters, drastically minimizing latency. It integrates multiple layer-wise operations into a singular kernel call, enhancing execution efficiency by maximizing GPU throughput.

Evaluation

Accuracy and Efficiency

Experiments demonstrate LoRA-Switch achieves accuracy comparable to other dynamic adapters while significantly reducing latency. The average decoding latency is reduced by 2.4 times compared to conventional methods (Table 1). This marks a notable improvement in inference efficiency, aligning closer to pretrained backbone performances without dynamic adapters.

Runtime Performance

Experimental results show that LoRA-Switch effectively lowers peak memory usage while enhancing runtime efficiency. Its system-optimized design results in less than a 30% latency increase compared to the original LLM configurations, outperforming conventional dynamic adapter models.

Conclusion

LoRA-Switch presents an optimized approach for dynamic adapter architectures in LLMs, substantially lowering inference latency without sacrificing accuracy. By integrating algorithmic innovations with system-level optimizations, this method sets a new standard for efficient LLM fine-tuning. This work provides substantial insight into dynamic adapter optimization, paving the way for future research and development in efficient model serving and deployment.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.