Papers
Topics
Authors
Recent
Search
2000 character limit reached

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Published 25 Jan 2024 in cs.LG and cs.DC | (2401.14351v2)

Abstract: This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for LLMs. By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.

Citations (5)

Summary

  • The paper introduces a locality-enhanced serverless inference system that cuts checkpoint download times and improves LLM inference latency.
  • The paper employs a novel multi-tier checkpoint format and token-based live migration process to enable dynamic, low-latency server allocation.
  • Experiments show a 10–200x latency improvement over conventional systems, demonstrating significant gains in efficiency and scalability.

Overview of ServerlessLLM

ServerlessLLM introduces a locality-enhanced serverless inference system exclusively for LLMs. It effectively leverages underutilized storage bandwidth and capacity available on GPU servers, thereby reducing remote checkpoint downloads and expediting checkpoint loading times.

Checkpoint Loading Optimization

The core of ServerlessLLM's design is a new loading-optimized checkpoint format and a multi-tier checkpoint loading system. This structure significantly enhances storage bandwidth use of GPU servers for LLM checkpoint loading. A notable introduction is a loading function that serves as a bridge between LLM libraries and ServerlessLLM's model manager, facilitating rapid and direct data transfer from storage to GPUs. This results in heightened performance, with ServerlessLLM outperforming current systems like PyTorch and Safetensors by a substantial margin across various LLM workloads.

Locality-Driven Inference and Live Migration

ServerlessLLM innovates live migration for LLM inference within serverless systems to effectively ensure locality-driven server allocation while preserving low latency. Two primary mechanisms power this live migration: an efficient token-based migration that identifies the minimal set of tokens needed for precise inference transfer and a two-stage live migration process that enables ongoing LLM inference transfer without affecting user experience. This novel approach allows ServerlessLLM to dynamically allocate servers based on locality, offering lower latency than methods reliant on model inference time prediction or those that preempt ongoing inferences.

Locality-Aware Server Allocation

ServerlessLLM integrates models to accurately estimate the loading time for checkpoints from different storage tiers and the time required for migrating an ongoing LLM inference to another server. Using these estimations, ServerlessLLM can intelligently schedule models, capitalizing on local checkpoint placement. This capability is crucial for enabling the system to evaluate each server's status in a cluster and to allocate resources for minimizing startup latency.

Comprehensive Experiments and Results

ServerlessLLM is rigorously tested through microbenchmarking and real-world traces. These experiments including comparing ServerlessLLM against various baseline methods like Safetensors and PyTorch along with running diverse LLM inference workloads in a GPU cluster setting. ServerlessLLM exhibits an impressive 10 - 200x improvement in latency performance over state-of-the-art systems, validating its model loading efficiency, efficacy of inference migration, and optimized server allocation strategy.

ServerlessLLM, with its innovative design and experimentally proven performance advantages, positions itself as a leading solution for sustainable, efficient, and cost-effective LLM inference services, paving the way for more scalable and responsive AI-powered applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 96 likes about this paper.