Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Published 15 Apr 2025 in cs.LG, cs.AI, cs.DC, math.OC, and stat.ML | (2504.11320v1)

Abstract: LLMs are indispensable in today's applications, but their inference procedure -- generating responses by processing text in segments and using a memory-heavy Key-Value (KV) cache -- demands significant computational resources, particularly under memory constraints. This paper formulates LLM inference optimization as a multi-stage online scheduling problem where sequential prompt arrivals and KV cache growth render conventional scheduling ineffective. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design. Building on this, we propose the Waiting for Accumulated Inference Threshold (WAIT) algorithm, which uses multiple thresholds to schedule incoming prompts optimally when output lengths are known, and extend it to Nested WAIT for cases with unknown output lengths. Theoretical analysis shows that both algorithms achieve near-optimal performance against the fluid benchmark in heavy traffic conditions, balancing throughput, latency, and Time to First Token (TTFT). Experiments with the Llama-7B model on an A100 GPU using both synthetic and real-world datasets demonstrate improved throughput and reduced latency relative to established baselines like vLLM and Sarathi. This work bridges operations research and machine learning, offering a rigorous framework for the efficient deployment of LLMs under memory constraints.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a fluid dynamics-based online scheduling approach that achieves asymptotically optimal throughput for LLM inference.
WAIT leverages known prompt output lengths to form batches under memory constraints, reducing latency and resource usage.
Nested WAIT extends the method to handle unknown output lengths, dynamically managing memory and ensuring high throughput under varied loads.

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Introduction

This paper tackles the challenge of optimizing inference for LLMs, which are integral to various NLP applications. The inference procedure requires significant computational resources due to the memory-intensive Key-Value (KV) cache used during the process. The paper proposes a novel approach to LLM inference optimization by framing it as a multi-stage online scheduling problem. This involves a fluid dynamics approximation to establish a benchmark for effective scheduling strategies, ultimately leading to the development of the WAIT and Nested WAIT algorithms.

Fluid Dynamics Approximation

The core contribution of this paper is the introduction of a fluid dynamics model to approximate the stochastic behavior of prompt arrivals and processing in LLM inference. This model serves as a benchmark for analyzing the system's equilibrium state, where prompt arrivals balance completions and memory usage stabilizes. Memory constraints are critical as they determine the maximum number of prompts a system can handle concurrently.

The paper begins with a single-type fluid model and extends it to a multi-type setup. The equilibrium analysis reveals that the optimal throughput is governed by the sum of arrival rates and the inherent time cost per batch iteration.

WAIT Algorithm

The WAIT algorithm is designed for settings where the output length of prompts is known at arrival. WAIT optimizes scheduling by maintaining thresholds for each prompt type, leveraging the fluid model to guide batch formation and scheduling decisions. The algorithm strategically accumulates prompts until specific thresholds are met, ensuring efficient resource usage and minimizing latency under heavy traffic conditions.

Heavy traffic analysis demonstrates that WAIT achieves asymptotic optimality, matching the fluid benchmark's throughput while maintaining manageable latency and time to the first token (TTFT) metrics.

Figure 1: Example of batching and scheduling.

Nested WAIT Algorithm

In real-world scenarios, the output length may not be known at the time of arrival, complicating the scheduling process. The Nested WAIT algorithm extends WAIT to accommodate unknown types by dynamically constructing a hierarchical framework of segments. Each segment processes prompts up to a specific output length, effectively managing random arrivals and dynamically expanding the KV cache.

The algorithm continues to use thresholds per segment, making it robust against the uncertainties of batch composition. It provides theoretical guarantees similar to WAIT, with memory bounds ensuring high-probability performance without exceeding capacity.

Figure 2: Pipeline of the Nested WAIT Algorithm.

Numerical Experiments

Experiments compare WAIT and Nested WAIT to other baseline methods such as vLLM and Sarathi, using synthetic and real-world datasets. These tests reveal that the proposed algorithms consistently outperform benchmarks in terms of throughput across both low and high-demand scenarios.

Figures demonstrate that WAIT reduces latency while enhancing throughput, catching up with the computed fluid dynamics benchmarks in practical settings.

Figure 3: Average throughput and latency across algorithms on the synthetic dataset.

Conclusion

The work bridges the disciplines of operations research and machine learning by providing a mathematical framework applicable to LLM inference under memory constraints. Future research directions include exploring multi-GPU systems, volatile arrival dynamics, and evolving system-level optimizations, all of which can further improve LLM deployment efficiency.

The analytical foundation laid by this study paves the way for robust, scalable scheduling solutions that optimize the performance of LLMs in diverse operating environments.

Markdown Report Issue