Papers
Topics
Authors
Recent
Search
2000 character limit reached

Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing

Published 28 Apr 2025 in cs.DC | (2504.19516v4)

Abstract: Modern LLM serving systems confront inefficient GPU utilization due to the fundamental mismatch between compute-intensive prefill and memory-bound decode phases. While current practices attempt to address this by organizing these phases into hybrid batches, such solutions create an inefficient tradeoff that sacrifices either throughput or latency, leaving substantial GPU resources underutilized. We identify two key root causes: 1) the prefill phase suffers from suboptimal compute utilization due to wave quantization and attention bottlenecks. 2) hybrid batches disproportionately prioritize latency over throughput, resulting in wasted compute and memory bandwidth. To mitigate the issues, we present Bullet, a novel spatial-temporal orchestration system that eliminates these inefficiencies through precise phase coordination. Bullet enables concurrent execution of prefill and decode phases, while dynamically provisioning GPU resources using real-time performance modeling. By integrating SLO-aware scheduling and adaptive resource allocation, Bullet maximizes utilization without compromising latency targets. Experimental evaluations on real-world workloads demonstrate that Bullet delivers 1.26x average throughput gains (up to 1.55x) over state-of-the-arts, while consistently meeting latency constraints.

Summary

  • The paper introduces Bullet, a novel system that dynamically orchestrates spatial-temporal GPU resource sharing to boost LLM serving throughput by an average of 1.26× without sacrificing latency.
  • It employs SLO-aware scheduling and adaptive resource allocation to concurrently manage compute-intensive prefill and memory-bound decode tasks, optimizing GPU utilization.
  • Experimental results reveal up to 1.55× throughput gains over current methods, offering a cost-effective solution for improving large-scale model serving efficiency.

Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing

Introduction

The paper "Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing" (2504.19516), introduces a novel technique to address inefficiencies in GPU utilization in LLM serving systems. It identifies inefficiencies in current practices due to the separation of compute-intensive prefill and memory-bound decode phases, resulting in underutilization of GPU resources. The authors propose an innovative system named 'Bullet' to enhance the efficiency of GPU utilization by leveraging dynamic spatial-temporal orchestration.

Methodology

Challenges and Root Causes

The paper elucidates the challenges in LLM serving systems, namely the inefficient trade-off between throughput and latency when combining compute-intensive prefill and memory-bound decode phases into hybrid batches. The root causes of inefficiencies are pinpointed as suboptimal compute utilization due to wave quantization and attention bottlenecks during the prefill phase. Additionally, the prioritization of latency over throughput in hybrid batches leads to wasted compute and memory bandwidth.

Proposed Solution: Bullet

The authors present Bullet, a spatial-temporal orchestration system designed to improve GPU utilization. Bullet enables concurrent execution of prefill and decode phases by dynamically provisioning GPU resources based on real-time performance modeling. This system incorporates SLO-aware scheduling and adaptive resource determination to maximize GPU utilization while adhering to latency targets. The spatial-temporal orchestration ensures precise phase coordination, eliminating inefficiencies and optimizing resource usage.

Experimental Evaluation

The evaluation of Bullet demonstrates significant improvements in throughput and resource utilization across real-world workloads. Experimental results reveal that Bullet achieves an average throughput gain of 1.26 times, with peaks up to 1.55 times compared to state-of-the-art methods. Importantly, these throughput gains are accomplished without compromising latency constraints, showcasing the system's efficacy in balancing throughput and latency.

Implications and Future Directions

The development and implementation of Bullet have practical implications for enhancing the performance and efficiency of LLM-serving systems. By optimizing the utilization of GPU resources, the system supports more efficient and cost-effective deployments in environments requiring high-throughput and low-latency. From a theoretical perspective, Bullet contributes to the understanding of dynamic resource allocation strategies, promoting advancements in spatial-temporal resource orchestration.

Future developments might focus on extending the proposed methodology to accommodate additional phases in LLM serving or exploring its applicability in diverse hardware architectures. Furthermore, the integration of more sophisticated machine learning models for real-time performance prediction could enhance Bullet's adaptability and efficiency in varying operational environments.

Conclusion

The paper provides compelling evidence for the effectiveness of spatial-temporal GPU resource sharing in LLM serving systems through the adoption of Bullet. By addressing the inefficiencies inherent in current serving strategies and explicitly optimizing resource allocation, the authors have demonstrated a significant leap in the throughput and utilization of GPU resources. The insights and methodologies presented set a critical foundation for future research and development in efficient large-scale model serving paradigms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Simple Explanation of “Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration”

What is this paper about?

This paper introduces Bullet, a system that helps computers answer lots of AI questions (like chatbots do) faster and more efficiently. It focuses on how to better use powerful chips called GPUs when serving LLMs, such as those behind popular AI assistants.

What problem are they trying to solve?

When an AI model answers a question, it usually goes through two steps:

  • Prefill: The model reads and processes the whole user prompt. This step is heavy on math and uses lots of GPU “muscle.”
  • Decode: The model then writes the answer one token (word or piece of a word) at a time. This step frequently needs to fetch a lot of stored information, so it’s limited more by memory speed than by pure math power.

These two steps stress the GPU in very different ways. Current systems often mix these steps together in one batch to keep the GPU busy, but that creates a bad tradeoff:

  • If they try to respond quickly (low latency), they don’t handle as many requests per second (lower throughput).
  • If they try to handle many requests (high throughput), individual responses might slow down (higher latency). Either way, a lot of the GPU’s power is still left unused.

What are the key questions the paper asks?

  • Why are GPUs underused when serving LLMs, even with smart batching?
  • Can we run the prefill and decode steps at the same time without slowing responses?
  • How do we split and schedule GPU resources so we meet response-time promises while serving more users?

How does Bullet work?

Think of the GPU as a big, shared kitchen:

  • Prefill is like a chef doing intense cooking (a lot of heavy work on the stove).
  • Decode is like a chef who keeps running to the pantry for ingredients (lots of waiting on memory).

Instead of putting both chefs in the same line where they get in each other’s way, Bullet:

  • Runs prefill and decode concurrently, but in a carefully organized way.
  • Splits the GPU “by space and time” (spatial-temporal orchestration):
    • Spatial: Give different parts of the GPU to different tasks.
    • Temporal: Schedule tasks at different moments so they don’t trip over each other.
  • Uses a real-time performance model to decide how much GPU each step should get right now. It’s like a traffic controller that watches current conditions and adjusts lanes and lights on the fly.
  • Is “SLO-aware.” An SLO (Service Level Objective) is a promise like “your reply will arrive within X milliseconds.” Bullet plans and schedules work to meet these time promises while still pushing for high throughput.

Along the way, the authors point out two root causes of waste:

  1. Prefill sometimes can’t use all GPU compute due to “wave quantization” and “attention bottlenecks.”
    • Wave quantization (simple view): The math work comes in chunks that don’t always fit the GPU perfectly, like passengers boarding a bus in uneven groups—some seats stay empty even if the bus is mostly full.
    • Attention bottlenecks: The attention mechanism must look up a lot of information, which can slow things down (like repeatedly checking a big notebook while writing).
  2. Hybrid batches (mixing prefill and decode in one batch) often lean too much toward keeping latency low, leaving compute units or memory bandwidth unused.

Bullet’s approach avoids these pitfalls by coordinating the two phases precisely and by adjusting resource splits dynamically using live measurements and predictions.

What did they find?

  • Bullet improved the number of requests served per second (throughput) by an average of 1.26× and up to 1.55× compared to leading systems.
  • It still met latency targets (responses stayed within the promised time). In short: more users served, same snappy feel.

Why is this important?

  • Better GPU use means lower costs for running AI at scale (fewer machines needed for the same workload).
  • Users get fast replies even when many people are using the service at once.
  • This helps AI services be more reliable, greener (less wasted power), and more affordable.

What’s the bigger impact?

Bullet shows that:

  • Treating the prefill and decode steps differently—and coordinating them smartly—can unlock a lot of hidden performance.
  • Real-time, data-driven scheduling on GPUs can keep both speed (latency) and productivity (throughput) high at the same time. This idea could influence future AI serving systems, making LLMs more efficient and widely accessible.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 2 likes about this paper.