Optimal Scheduling for LLM Inference

Updated 18 February 2026

Optimal Scheduling Algorithms for LLM Inference is focused on minimizing latency and maximizing throughput by efficiently allocating GPU resources under unpredictable output lengths.
The conservative A_max method uses predicted upper bounds to avoid memory overflow but often leads to resource under-utilization, while the adaptive A_min refines lower-bound estimates to maintain robust performance.
Empirical evaluations demonstrate that A_min achieves near-optimal throughput and latency on real-world LLM workloads, making it ideal for complex, uncertain inference environments.

Optimal scheduling algorithms for LLM inference seek to minimize latency, maximize throughput, and efficiently utilize compute and memory resources, particularly under the constraints of unknown or uncertain output lengths, GPU KV-cache limitations, and unpredictable request characteristics. This area combines algorithmic advances in online and robust scheduling, learning-to-rank methods, competitive analysis, and practical system design, with particular emphasis on resilience against prediction error and adaptability to real workload distributions.

1. Problem Setting and Scheduling Challenges

LLM inference comprises an online, multi-job service with each job representing a user request. Upon arrival, the prompt length is known, but the output length—the dominant factor in both KV-cache memory consumption and service duration—is unknown. Each active request contributes $s_i + a_i$ tokens to the cache, where $s_i$ is the prompt length and $a_i$ is the number of output tokens generated so far. The scheduler allocates GPU resources under a hard memory constraint, seeking to minimize the total end-to-end latency, defined as the sum across jobs of each job’s completion time.

The fundamental challenge is the uncertainty in output lengths: only an interval $[\ell_i, U_i]$ is available for each job, derived from lightweight predictors. Underestimating leads to memory overflow, while overestimating causes severe under-utilization of available resources. The scheduler must decide, online and non-preemptively (but allowing for opportunistic cancellation and restart), which requests to start and in what order.

2. Robust Scheduling Algorithms under Output-Length Uncertainty

2.1. $\mathcal{A}_{\max}$ : Conservative Upper-Bound Scheduling

This baseline algorithm operates under maximal caution: each request's output length is assumed to be its predicted upper bound $U_i$ . At every step, as many jobs as possible are started such that, accounting for their projected memory usage (based on $U_i$ ), the cache capacity $M$ will never be exceeded at any future iteration. While this avoids memory overflow, it wastes substantial capacity when upper bounds are loose, significantly inflating total latency.

The competitive ratio of $\mathcal{A}_{\max}$ , measured against the hindsight-optimal scheduler with perfect length knowledge, is $\operatorname{CR}(\mathcal{A}_{\max}) = \Theta(\alpha^{-1.5}\ldots\alpha^{-2})$ where $\alpha = \ell/U$ is the relative interval tightness. As $\alpha \to 0$ (intervals widen), performance deteriorates rapidly (Chen et al., 20 Aug 2025).

2.2. $\mathcal{A}_{\min}$ : Adaptive Lower-Bound Scheduling

$\mathcal{A}_{\min}$ is an adaptively robust scheduling algorithm that initializes every job with its predicted lower bound $\ell_i$ as the output length estimate. During execution, if memory would be overrun as jobs accumulate output tokens, the scheduler evicts jobs with the smallest attained progress $\tilde{o}_i$ (the number of generated tokens), updates $\tilde{o}_i$ to this value, and returns these jobs to the pending queue for possible later restart.

Key properties:

Uses only lower bounds, requiring minimal prediction accuracy.
Refines estimates on-the-fly; jobs are evicted and their estimates corrected only if they threaten cache overflow.
Per-batch complexity is $O(M \log M)$ , as jobs are sorted by $\tilde{o}_i$ to prioritize evictions.
Theoretical guarantee: $\operatorname{CR}(\mathcal{A}_{\min}) = O(\log \alpha^{-1})$ , i.e., $\mathcal{A}_{\min}$ maintains logarithmic competitive ratio even for wide uncertainty intervals. It remains close to optimal both in theory and across real LLM workloads (Chen et al., 20 Aug 2025).

3. Theoretical Performance and Competitive Analysis

The competitive ratio framework evaluates online algorithms against the ideal offline scheduler, which knows all $o_i$ in advance. Under homogeneous intervals $[\ell, U]$ , the main results are:

Algorithm	Competitive Ratio	Sensitivity to Interval
$\mathcal{A}_{\max}$	$\Theta(\alpha^{-1.5}…\alpha^{-2})$	Blows up as $\alpha \to 0$
$\mathcal{A}_{\min}$	$O(\log \alpha^{-1})$	Remains moderate as $\alpha \to 0$

As interval estimates become loose ( $\alpha \ll 1$ ), $\mathcal{A}_{\min}$ 's performance remains within a logarithmic factor, while $\mathcal{A}_{\max}$ becomes impractical. When interval accuracy is high ( $\alpha \sim 1$ ), both algorithms perform comparably and near-optimally (Chen et al., 20 Aug 2025).

4. Empirical Evaluation and Practical Implementation

Experiments on real-world datasets (e.g., 2000 samples from LMSYS-Chat-1M using LLaMA2-70B) empirically validate the algorithms under three interval prediction settings:

Extreme [1, 1000] intervals for all jobs,
Binned intervals of width 100,
Overlapping intervals of the form $[(1-x)o_i, (1+x)o_i]$ with $x \in \{0.1, 0.95, 0.99\}$ .

Results:

Under wide intervals ( $\alpha \ll 1$ ), $\mathcal{A}_{\min}$ matches the performance of the hindsight-optimal scheduler, while $\mathcal{A}_{\max}$ suffers 2–3 $\times$ higher latency.
As intervals narrow, $\mathcal{A}_{\max}$ improves but remains inferior to $\mathcal{A}_{\min}$ .
Even at extreme looseness ( $x=0.99$ ), $\mathcal{A}_{\min}$ tracks the offline optimum, demonstrating robustness to poor upper-bound predictions (Chen et al., 20 Aug 2025).

Implementation notes:

Per-iteration overhead of $O(M \log M)$ is modest, with typical GPU memory capacities.
Lower-bound predictions can be generated with lightweight regressors or quantile-based methods.

5. Algorithm Selection and Deployment Considerations

Tight interval predictions ( $\alpha \sim 1$ ): $\mathcal{A}_{\max}$ is simple and effective.
Loose or asymmetric intervals ( $\alpha \ll 1$ ) or when only lower bounds are reliably predicted: $\mathcal{A}_{\min}$ offers adaptivity and optimality guarantees.
In all scenarios with significant tail risk or predictor uncertainty, the adaptivity of $\mathcal{A}_{\min}$ makes it the default robust choice.

6. Broader Implications and Extensions

This adaptively robust approach to LLM inference scheduling is broadly extensible:

Easily incorporates lightweight learning-based predictors for output length intervals.
Generalizes to multi-GPU settings, batch-heterogeneous architectures, and distributed inference systems by suitable memory tracking and coordinated eviction schemes.
The methodology provides a template for robust resource allocation under uncertainty where hard constraints (e.g., memory, energy) and variable service requirements prevail.

The “adaptively optimal” design of $\mathcal{A}_{\min}$ —aggressively batching jobs under lower-bound estimates and invoking on-the-fly corrections upon detecting immanent capacity violation—demonstrates that learning-augmented algorithms can achieve practical, robust, and theoretically sound scheduling in the high-variance environment of LLM serving (Chen et al., 20 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Scheduling Algorithms for LLM Inference.