ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Published 20 May 2025 in cs.LG and cs.DC | (2505.14468v1)

Abstract: Serverless computing has grown rapidly for serving LLM inference due to its pay-as-you-go pricing, fine-grained GPU usage, and rapid scaling. However, our analysis reveals that current serverless can effectively serve general LLM but fail with Low-Rank Adaptation (LoRA) inference due to three key limitations: 1) massive parameter redundancy among functions where 99% of weights are unnecessarily duplicated, 2) costly artifact loading latency beyond LLM loading, and 3) magnified resource contention when serving multiple LoRA LLMs. These inefficiencies lead to massive GPU wastage, increased Time-To-First-Token (TTFT), and high monetary costs. We propose ServerlessLoRA, a novel serverless inference system designed for faster and cheaper LoRA LLM serving. ServerlessLoRA enables secure backbone LLM sharing across isolated LoRA functions to reduce redundancy. We design a pre-loading method that pre-loads comprehensive LoRA artifacts to minimize cold-start latency. Furthermore, ServerlessLoRA employs contention aware batching and offloading to mitigate GPU resource conflicts during bursty workloads. Experiment on industrial workloads demonstrates that ServerlessLoRA reduces TTFT by up to 86% and cuts monetary costs by up to 89% compared to state-of-the-art LLM inference solutions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ServerlessLoRA, a novel approach to minimize latency and cost in LoRA-based LLM inference.
It employs pre-loading schedulers, dynamic batching, and GPU memory offloading to optimize resource management in serverless deployments.
Performance evaluations show significant reductions in TTFT, monetary costs, and resource contention, ensuring scalable and efficient serverless inference.

Introduction

The paper "ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs" introduces a novel approach to address inefficiencies and scalability issues in serverless inference systems for Low-Rank Adaptation (LoRA)-based LLMs. The authors identify three key problems: massive parameter redundancy, costly artifact loading latency, and resource contention during inference. ServerlessLoRA is proposed as a solution to overcome these challenges by sharing backbone models, pre-loading artifacts, and optimizing resource management in serverless deployments.

Figure 1: System overview.

Serverless LLM Inference

Serverless computing is highlighted for its advantages in LLM inference due to its ability to offer pay-as-you-go pricing, fine-grained resource usage, and rapid scalability. Unlike serverful architectures, which are less agile and resource-efficient, serverless platforms can dynamically allocate resources based on demand, saving costs and improving response times.

However, serving LoRA-based LLMs with serverless architectures poses unique challenges due to backbone model redundancy, leading to increased TTFT and costs. The sharing mechanism and handling of artifacts are crucial for serverless architectures aiming to manage the specific demands of LoRA models efficiently.

Figure 2: Cost-effectiveness of serverless and serverful solutions for one Llama2-7B base LLM.

System Architecture

ServerlessLoRA introduces several components to facilitate efficient LoRA inference. Key components include:

Pre-Loading Scheduler: Determines optimal pre-loading of artifacts in GPU and container memory, leveraging idle resources, and maximizing performance while minimizing overhead.
Batching Scheduler: Aggregates requests dynamically to optimize batch sizes, balancing between latency and throughput while managing resource contention effectively.
Pre-Loading Agent: Implements pre-loading decisions and manages instances on worker nodes, ensuring efficient artifact handling and reducing cold-start latency.
Dynamic Offloader: Manages GPU memory intelligently during bursts, offloading non-essential artifacts to maximize available resources for concurrent requests.
Figure 3: Backbone LLM sharing among function instances.

Performance Evaluation

The evaluation metrics focus on TTFT, TPOT, monetary cost, throughput, and SLO violation rates. ServerlessLoRA is shown to significantly reduce TTFT and monetary costs compared to serverless and serverful baselines. The system's backbone sharing and pre-loading strategies ensure that resources are used efficiently, enabling substantial cost and latency reductions (Figure 4).

Moreover, ServerlessLoRA retains scalability by dynamically adjusting to workload demands without compromising responsiveness. It maximizes throughput without incurring high SLO violation rates, demonstrating robustness under varied computational loads.

Figure 5: Average TTFT of the workloads at Predictable'',Normal'', and ``Bursty'' arrival patterns.

Implementation Details

ServerlessLoRA is implemented using Python and CUDA to facilitate backbone sharing via CUDA Inter-Process Communication (IPC). The serverless architecture adheres to isolation principles while allowing multiple function instances to utilize shared backbone memory efficiently. Pre-loading mechanisms and adaptive batching ensure that functions are ready to serve requests with minimal delay.

Figure 6: Trace example of Predictable'' (CoV leq1),Normal'' (1< CoV leq4), and ``Bursty'' request arrival pattern.

Conclusion

ServerlessLoRA represents a strategic evolution in serverless architecture for LoRA-based LLMs, addressing key inefficiencies and optimizing deployment costs effectively. By leveraging shared model resources and intelligently pre-loading artifacts, ServerlessLoRA enhances both latency and economic metrics. Its design significantly improves the feasibility of deploying specialized LLMs on serverless platforms, potentially transforming practical implementations across varied AI applications.

Markdown Report Issue