Papers
Topics
Authors
Recent
Search
2000 character limit reached

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Published 28 Apr 2025 in cs.CL, cs.DC, and cs.LG | (2504.19867v1)

Abstract: Existing LLM serving systems fall into two categories: 1) a unified system where prefill phase and decode phase are co-located on the same GPU, sharing the unified computational resource and storage, and 2) a disaggregated system where the two phases are disaggregated to different GPUs. The design of the disaggregated system addresses the latency interference and sophisticated scheduling issues in the unified system but leads to storage challenges including 1) replicated weights for both phases that prevent flexible deployment, 2) KV cache transfer overhead between the two phases, 3) storage imbalance that causes substantial wasted space of the GPU capacity, and 4) suboptimal resource adjustment arising from the difficulties in migrating KV cache. Such storage inefficiency delivers poor serving performance under high request rates. In this paper, we identify that the advantage of the disaggregated system lies in the disaggregated computation, i.e., partitioning the computational resource to enable the asynchronous computation of two phases. Thus, we propose a novel LLM serving system, semi-PD, characterized by disaggregated computation and unified storage. In semi-PD, we introduce a computation resource controller to achieve disaggregated computation at the streaming multi-processor (SM) level, and a unified memory manager to manage the asynchronous memory access from both phases. semi-PD has a low-overhead resource adjustment mechanism between the two phases, and a service-level objective (SLO) aware dynamic partitioning algorithm to optimize the SLO attainment. Compared to state-of-the-art systems, semi-PD maintains lower latency at higher request rates, reducing the average end-to-end latency per request by 1.27-2.58x on DeepSeek series models, and serves 1.55-1.72x more requests adhering to latency constraints on Llama series models.

Summary

  • The paper introduces semi-PD, a system that integrates phase-wise disaggregated computation with unified storage to overcome latency interference and storage inefficiencies.
  • The system employs a dynamic resource controller and an SLO-aware algorithm to efficiently manage resource partitioning between prefill and decode phases.
  • Performance evaluation shows up to 2.58× lower latency and 1.72× increased throughput compared to state-of-the-art methods.

Efficient LLM Serving with Phase-Wise Disaggregated Computation and Unified Storage

This paper explores a novel system for serving LLMs called semi-PD, which strategically combines disaggregated computation with unified storage. The design addresses significant inefficiencies in existing LLM-serving architectures, particularly focusing on computational and storage paradigms. The study emphasizes enhancing service-level objective (SLO) compliance while reducing latency and increasing throughput.

Introduction and Motivation

The rapid proliferation of LLM deployments in applications such as chatbots and code assistants necessitates systems capable of handling high throughput with minimal latency. Traditional serving systems fall into two categories: unified and disaggregated designs. Unified systems, where prefill and decode phases share resources, struggle with latency interference. In contrast, disaggregated systems alleviate computational interference, but at the cost of storage inefficiencies. This paper identifies four critical issues in disaggregated designs: storage imbalance, KV cache transfer overhead, resource adjustment overhead, and replicated weights. Figure 1

Figure 1: Illustration of the pros and cons of the different computation and storage patterns. semi-PD can have the advantages of both disaggregated computation and unified storage.

System Design

Disaggregated Computation and Unified Storage

Semi-PD introduces a serving system that effectively combines the advantages of disaggregated computation and unified storage. This system deploys a computation resource controller to manage SM-level disaggregation, enabling efficient resource partitioning between prefill and decode phases. For storage, semi-PD uses a unified memory manager to coordinate asynchronous access, addressing issues of KV cache transfer and resource adjustment overhead.

Low-Overhead Resource Adjustment

Semi-PD’s standout feature is a low-overhead switching mechanism that dynamically adjusts computational resources between phases. By maintaining a resident process that consistently loads weights and KV cache, the system avoids the high latency of reloading and adjustments. This dynamic adjustment is guided by an SLO-aware algorithm that periodically tunes the resource division based on real-time demand and latency requirements. Figure 2

Figure 2: System overview of semi-PD.

Evaluation

The evaluation of semi-PD demonstrates substantial reductions in request latency and increases in throughput under varied deployment scenarios. Compared to state-of-the-art methods like DistServe and vLLM, semi-PD shows up to 2.58× lower latency and 1.72× more requests served under given SLOs. The system consistently achieves high SLO adherence due to its adaptive resource management, even under high-load conditions. Figure 3

Figure 3: The P90 TTFT and TPOT comparison on Llama series models (lower is better). For Llama3.1-405B, only semi-PD was tested with vLLM-S and vLLM-D, as DistServe couldn't be deployed due to storage constraints.

Implications and Future Work

The findings from semi-PD suggest significant potential for improving LLM-serving infrastructures, particularly in environments with dynamic workloads and tight latency constraints. Future developments could focus on further optimizing the low-overhead resource adjustment mechanism for even finer granular control and adapting the architecture for emerging hardware accelerators.

Conclusion

Semi-PD demonstrates significant potential in addressing the dual challenges of computational interference and storage inefficiency in LLM serving. Its architectural innovations in phased disaggregated computation and unified storage provide a compelling framework for enhancing LLM service performance, particularly in adherence to stringent latency constraints. With its demonstrated reductions in latency and improvements in throughput, semi-PD offers a robust solution for next-generation LLM serving systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.