Optimal Scheduling for LLM Inference
- Optimal Scheduling Algorithms for LLM Inference is focused on minimizing latency and maximizing throughput by efficiently allocating GPU resources under unpredictable output lengths.
- The conservative A_max method uses predicted upper bounds to avoid memory overflow but often leads to resource under-utilization, while the adaptive A_min refines lower-bound estimates to maintain robust performance.
- Empirical evaluations demonstrate that A_min achieves near-optimal throughput and latency on real-world LLM workloads, making it ideal for complex, uncertain inference environments.
Optimal scheduling algorithms for LLM inference seek to minimize latency, maximize throughput, and efficiently utilize compute and memory resources, particularly under the constraints of unknown or uncertain output lengths, GPU KV-cache limitations, and unpredictable request characteristics. This area combines algorithmic advances in online and robust scheduling, learning-to-rank methods, competitive analysis, and practical system design, with particular emphasis on resilience against prediction error and adaptability to real workload distributions.
1. Problem Setting and Scheduling Challenges
LLM inference comprises an online, multi-job service with each job representing a user request. Upon arrival, the prompt length is known, but the output length—the dominant factor in both KV-cache memory consumption and service duration—is unknown. Each active request contributes tokens to the cache, where is the prompt length and is the number of output tokens generated so far. The scheduler allocates GPU resources under a hard memory constraint, seeking to minimize the total end-to-end latency, defined as the sum across jobs of each job’s completion time.
The fundamental challenge is the uncertainty in output lengths: only an interval is available for each job, derived from lightweight predictors. Underestimating leads to memory overflow, while overestimating causes severe under-utilization of available resources. The scheduler must decide, online and non-preemptively (but allowing for opportunistic cancellation and restart), which requests to start and in what order.
2. Robust Scheduling Algorithms under Output-Length Uncertainty
2.1. : Conservative Upper-Bound Scheduling
This baseline algorithm operates under maximal caution: each request's output length is assumed to be its predicted upper bound . At every step, as many jobs as possible are started such that, accounting for their projected memory usage (based on ), the cache capacity will never be exceeded at any future iteration. While this avoids memory overflow, it wastes substantial capacity when upper bounds are loose, significantly inflating total latency.
The competitive ratio of , measured against the hindsight-optimal scheduler with perfect length knowledge, is where is the relative interval tightness. As (intervals widen), performance deteriorates rapidly (Chen et al., 20 Aug 2025).
2.2. : Adaptive Lower-Bound Scheduling
is an adaptively robust scheduling algorithm that initializes every job with its predicted lower bound as the output length estimate. During execution, if memory would be overrun as jobs accumulate output tokens, the scheduler evicts jobs with the smallest attained progress (the number of generated tokens), updates to this value, and returns these jobs to the pending queue for possible later restart.
Key properties:
- Uses only lower bounds, requiring minimal prediction accuracy.
- Refines estimates on-the-fly; jobs are evicted and their estimates corrected only if they threaten cache overflow.
- Per-batch complexity is , as jobs are sorted by to prioritize evictions.
- Theoretical guarantee: , i.e., maintains logarithmic competitive ratio even for wide uncertainty intervals. It remains close to optimal both in theory and across real LLM workloads (Chen et al., 20 Aug 2025).
3. Theoretical Performance and Competitive Analysis
The competitive ratio framework evaluates online algorithms against the ideal offline scheduler, which knows all in advance. Under homogeneous intervals , the main results are:
| Algorithm | Competitive Ratio | Sensitivity to Interval |
|---|---|---|
| Blows up as | ||
| Remains moderate as |
As interval estimates become loose (), 's performance remains within a logarithmic factor, while becomes impractical. When interval accuracy is high (), both algorithms perform comparably and near-optimally (Chen et al., 20 Aug 2025).
4. Empirical Evaluation and Practical Implementation
Experiments on real-world datasets (e.g., 2000 samples from LMSYS-Chat-1M using LLaMA2-70B) empirically validate the algorithms under three interval prediction settings:
- Extreme [1, 1000] intervals for all jobs,
- Binned intervals of width 100,
- Overlapping intervals of the form with .
Results:
- Under wide intervals (), matches the performance of the hindsight-optimal scheduler, while suffers 2–3 higher latency.
- As intervals narrow, improves but remains inferior to .
- Even at extreme looseness (), tracks the offline optimum, demonstrating robustness to poor upper-bound predictions (Chen et al., 20 Aug 2025).
Implementation notes:
- Per-iteration overhead of is modest, with typical GPU memory capacities.
- Lower-bound predictions can be generated with lightweight regressors or quantile-based methods.
5. Algorithm Selection and Deployment Considerations
- Tight interval predictions (): is simple and effective.
- Loose or asymmetric intervals () or when only lower bounds are reliably predicted: offers adaptivity and optimality guarantees.
- In all scenarios with significant tail risk or predictor uncertainty, the adaptivity of makes it the default robust choice.
6. Broader Implications and Extensions
This adaptively robust approach to LLM inference scheduling is broadly extensible:
- Easily incorporates lightweight learning-based predictors for output length intervals.
- Generalizes to multi-GPU settings, batch-heterogeneous architectures, and distributed inference systems by suitable memory tracking and coordinated eviction schemes.
- The methodology provides a template for robust resource allocation under uncertainty where hard constraints (e.g., memory, energy) and variable service requirements prevail.
The “adaptively optimal” design of —aggressively batching jobs under lower-bound estimates and invoking on-the-fly corrections upon detecting immanent capacity violation—demonstrates that learning-augmented algorithms can achieve practical, robust, and theoretically sound scheduling in the high-variance environment of LLM serving (Chen et al., 20 Aug 2025).