LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Published 28 May 2024 in cs.AI | (2405.17741v1)

Abstract: Recent literature has found that an effective method to customize or further improve LLMs is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For efficiency, this switching is implemented with an optimized CUDA kernel, which fuses the merging operations for all LoRA adapters at once. Based on experiments with popular open-source LLMs on common benchmarks, our approach has demonstrated similar accuracy improvement as existing dynamic adapters, while reducing the decoding latency by more than 2.4 times.

Abstract PDF HTML Upgrade to Chat

References (42)

Citations (3)

View on Semantic Scholar

Summary

The paper presents LoRA-Switch, which introduces token-wise routing to reduce dynamic adapter overhead and improve inference efficiency.
The innovative design fuses adapter switching with an SGMM kernel to consolidate CUDA operations, cutting decoding latency by 2.4 times.
The implementation achieves comparable accuracy to standard methods while lowering peak memory usage and streamlining dynamic fine-tuning.

LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design

Introduction

This paper proposes an innovative system-algorithm co-designed architecture, LoRA-Switch, to enhance the efficiency of dynamic adapters in LLMs. Dynamic adapters like Low-Rank Adapters (LoRA) combined with Mixture-of-Experts (MoE) structures present a method to fine-tune LLMs. However, these dynamic adapters incur significant inference latency due to excessive CUDA kernel overhead. LoRA-Switch addresses this by introducing token-wise routing mechanisms and optimized CUDA kernel implementations, significantly reducing decoding latency while maintaining accuracy.

Background and Motivation

Dynamic adapters enhance LLM capabilities by integrating lightweight, conditionally-computed adapters into pretrained models. Despite minimal impact on parameter size, these adapters significantly increase inference latency, often between 250-950%, due to CUDA kernel execution overhead.

Challenge: The primary challenge lies in the significant latency overhead caused by the dynamic adapters (Figure 1). The fragmented CUDA kernel calls during the decoding phase significantly increase execution time, contributing to higher latency.

Figure 1: Decoding phase execution time profiling of one dynamic adapters layer in MoRAL~\cite{yang2024moral}.

The Design of LoRA-Switch

LoRA-Switch, depicted in Figure 2, is proposed as a token-wise routing mechanism, diverging from traditional layer-wise and block-wise routing. This novel approach enables more efficient integration with system-level optimizations.

Figure 2: Overview of LoRA-Switch.

Model Structure

LoRA-Switch extends adapters only within the linear layers of the pretrained backbone and employs a token-wise routing strategy. Each token is routed through specific adapters determined by the token-wise gating mechanism. This architecture allows efficient parameter integration into the backbone model, reducing overall latency.

Fused Adapter Switching

LoRA-Switch optimizes performance through pre-gated, fused adapter switching, reducing excessive CUDA kernel calls typically required in traditional dynamic adapters. By fusing and merging active adapters before each token's decoding, the approach leverages efficient computational workflows that align closely with GPU resource management.

SGMM Kernel Implementation

The SGMM kernel is designed to consolidate GEMM operations across multiple adapters, drastically minimizing latency. It integrates multiple layer-wise operations into a singular kernel call, enhancing execution efficiency by maximizing GPU throughput.

Evaluation

Accuracy and Efficiency

Experiments demonstrate LoRA-Switch achieves accuracy comparable to other dynamic adapters while significantly reducing latency. The average decoding latency is reduced by 2.4 times compared to conventional methods (Table 1). This marks a notable improvement in inference efficiency, aligning closer to pretrained backbone performances without dynamic adapters.

Runtime Performance

Experimental results show that LoRA-Switch effectively lowers peak memory usage while enhancing runtime efficiency. Its system-optimized design results in less than a 30% latency increase compared to the original LLM configurations, outperforming conventional dynamic adapter models.

Conclusion

LoRA-Switch presents an optimized approach for dynamic adapter architectures in LLMs, substantially lowering inference latency without sacrificing accuracy. By integrating algorithmic innovations with system-level optimizations, this method sets a new standard for efficient LLM fine-tuning. This work provides substantial insight into dynamic adapter optimization, paving the way for future research and development in efficient model serving and deployment.

Markdown Report Issue