Optimal Scheduling Algorithms for LLM Inference: Theory and Practice

Published 1 Aug 2025 in cs.LG and cs.DC | (2508.01002v2)

Abstract: With the growing use of LLM-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under mild conditions. To address practical Service Level Objectives (SLOs) such as serving requests with different Time Between Token (TBT) constraints, we design the SLO-Aware LLM Inference (SLAI) scheduler. SLAI uses real-time measurements to prioritize decode requests that are close to missing their TBT deadlines and reorders prefill requests based on known prompt lengths to further reduce the Time To First Token (TTFT) delays. We evaluate SLAI on the Openchat ShareGPT4 dataset using the Mistral-7B model on an NVIDIA RTX ADA 6000 GPU. Compared to Sarathi-Serve, SLAI reduces the median TTFT by 53% and increases the maximum serving capacity by 26% such that median TTFT is below 0.5 seconds, while meeting tail TBT latency constraints.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a theoretical framework and a resource-aware dynamic (RAD) scheduler that optimizes both throughput and latency in LLM inference.
The authors design an SLO-aware scheduler (SLAI) that minimizes Time To First Token (TTFT) while meeting heterogeneous Time Between Tokens (TBT) constraints.
Experimental results on the Mistral-7B model show a 53% reduction in median TTFT and a 26% improvement in maximum serving capacity compared to baseline systems.

Summary of "Optimal Scheduling Algorithms for LLM Inference: Theory and Practice"

Introduction

The paper "Optimal Scheduling Algorithms for LLM Inference: Theory and Practice" (2508.01002) addresses the challenge of optimizing the inference serving systems for LLMs such as GPT-4 and Mistral-7B, which are rapidly used across various industries for applications like chatbots and coding assistants. The inference process in these systems is split into two phases: a prefill-phase and a decode-phase. The prefill-phase processes the input prompt, while the decode-phase generates the output tokens one by one. The paper emphasizes the need for new scheduling strategies to optimize throughput and latency, measured by Time To First Token (TTFT) and Time Between Tokens (TBT).

Contributions

The authors present several key contributions:

Theoretical Framework: They propose a comprehensive framework for modeling request routing and scheduling in LLM inference systems, capturing the salient features for practical policies.
Throughput Optimal Scheduler: A Resource-Aware Dynamic (RAD) scheduler is introduced, achieving throughput optimality under mild assumptions by using optimal tiling and dynamic resource allocation between prefill and decode workloads.
Practical Insights: Practical insights into approximating the proposed design principles in deployments are discussed, particularly focusing on Service Level Objectives (SLOs).
SLO-Aware Scheduler: They design a scheduler called SLO-Aware LLM Inference (SLAI) to minimize TTFT while adhering to heterogeneous TBT constraints, focusing on prioritizing critical requests to meet SLO deadlines effectively.
Experimental Performance: SLAI is evaluated against Sarathi-Serve, showing significant efficiency improvements in TTFT and maximum serving capacity.

Experimental Findings

The experimental evaluation using the Mistral-7B model demonstrates that SLAI reduces median TTFT by 53% and improves maximum serving capacity by 26% compared to Sarathi-Serve. The improvement is noted under constraints that the median TTFT does not exceed 0.5 seconds, while tail TBT latency is maintained within required limits.

Implications

The implications of this work are both practical and theoretical. Practically, the proposed algorithms provide a way to significantly enhance LLM inference efficiency and fulfill stringent SLOs, thus improving user experience while optimizing hardware utilization. Theoretically, the paper contributes insights into optimal scheduling policies that can guide future research and development of LLM serving systems.

Conclusion

The paper successfully identifies critical design principles for automating and optimizing LLM inference systems. It provides robust solutions to address throughput and latency issues, essential for real-world applications of LLMs. Future developments may further explore adaptive scheduling tactics and explore broader contexts within heterogeneous computing environments, enhancing both single-node and distributed inference systems.

Markdown Report Issue