Papers
Topics
Authors
Recent
Search
2000 character limit reached

ISRTF Scheduling for LLM Inference

Updated 7 January 2026
  • ISRTF Scheduling is a priority-based inference strategy that uses neural response-length predictors to estimate remaining tokens and dynamically reorder tasks.
  • It employs fixed batch windows (K tokens) to balance efficient GPU utilization with SRTF principles, reducing head-of-line delays in LLM systems.
  • The approach demonstrates significant reductions in job completion and queuing times, with empirical benchmarks showing robust improvements over FCFS scheduling.

Iterative Shortest Remaining Time First (ISRTF) scheduling is a priority-based inference scheduling strategy designed for LLM serving systems. ISRTF addresses the head-of-line blocking problem inherent in first-come-first-served (FCFS) and naive batching schedulers by leveraging response length prediction and periodic priority updates to match the dynamics of auto-regressive decoding in LLMs. ISRTF is exemplified in the ELIS ("Efficient LLM Iterative Scheduling System with Response Length Predictor") system, demonstrating substantial improvements in average job completion time in both simulated and production deployments (Choi et al., 14 May 2025).

1. Motivation and Background

Traditional LLM serving systems often rely on FCFS scheduling, primarily for its simplicity and easy batching for GPU inference. However, FCFS causes short inference tasks to wait for earlier, longer tasks, leading to head-of-line blocking and inefficiency, particularly in high-variance user-driven workloads. Classic scheduling theory suggests shortest remaining job first (SRJF) as optimal for minimizing average job completion time, but direct adoption is hindered by the need to accurately predict LLM inference durations, which depend on unknown output lengths and the non-preemptible nature of batched decoding.

ISRTF circumvents these challenges by:

  • Integrating a neural response length predictor that estimates, for each active request, the number of tokens left to decode.
  • Reassessing and adjusting scheduling priorities every KK tokens (the "batch window") rather than every token, harmonizing SRTF principles with hardware-efficient batching.

This design mitigates head-of-line blocking while maintaining the industrial requirement for high-throughput GPU utilization.

2. Mathematical Foundations of Remaining Time Estimation

At the core of ISRTF scheduling is the response length predictor. For a given prompt xx (possibly concatenated with partial output tokens), the predictor L^(x;θ)\hat{L}(x;\theta) is realized as a deep neural network: a frozen BGE encoder followed by eight fully connected layers (h=1024h=1024 neurons per layer with ReLU), trained to regress the total output length. The objective function is mean squared error: L(θ)=1N∑i=1N(L^(xi;θ)−Li)2\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N (\hat{L}(x_i;\theta) - L_i)^2 For an in-process job jj starting iteration kk, input is xj(k)=[promptj,generated tokensj,1:(k−1)K]x_j^{(k)} = [\text{prompt}_j, \text{generated tokens}_{j,1:(k-1)K}], and predicted remaining tokens: R^j(k)=L^(xj(k))−(k−1)K\hat{R}_j^{(k)} = \hat{L}(x_j^{(k)}) - (k-1)K The expected remaining latency is then: T^j(k)=αj+βjR^j(k)\hat{T}_j^{(k)} = \alpha_j + \beta_j \hat{R}_j^{(k)} with αj\alpha_j as the measured time-to-first-token and βj\beta_j as time-per-output-token for job jj.

3. ISRTF Scheduling Algorithm and Batch Windowing

ISRTF operates in discrete batch windows:

  1. For each job in the job pool, remaining time is re-estimated at the start of each KK-token window using the response length predictor.
  2. Jobs are reprioritized according to predicted T^j\hat{T}_j; up to MM jobs (the GPU batch size) with the lowest remaining times are selected.
  3. The selected batch is decoded for up to KK tokens; upon completion, jobs with completed outputs are removed from the pool.
  4. Jobs that are not yet complete remain in the pool and participate in the next round.

Pseudocode sketch:

1
2
3
4
5
6
7
8
9
while JobPool not empty:
    for j in JobPool:
        L_pred = hat_L(j.prompt + j.generated)
        R_j = max(0, L_pred - (j.iteration-1)*K)
        T_j = alpha_j + beta_j * R_j
        j.priority = T_j
    Batch = select_top_M(JobPool, key=j.priority)
    outputs = decode(Batch, max_tokens=K)
    update jobs; remove if complete

Prioritization and preemption occur only at batch boundaries, not every token, balancing scheduling optimality and system efficiency.

4. Predictor Architecture, Training, and Performance

ISRTF's predictive component is trained on ≈143,000\approx143,000 prompt–answer pairs collected via vLLM across 13 open LLMs (LLaMA variants, Vicuna, OPT, GPT-NeoX). The dataset is deduplicated and split 60%/20%/20% for train/validation/test. The final model, with a frozen BGE-base-en-v1.5 encoder and eight-layer MLP head, achieves MAE = 19.923 tokens, RMSE = 34.327 tokens, and R2=0.852R^2 = 0.852, outperforming fine-tuned baselines. Training uses Adam with learning rate 1×10−41\times10^{-4} and batch size 16, converging after 16 epochs on NVIDIA A100 hardware.

Performance evaluation during batch window scheduling demonstrates that prediction updates introduce minimal overhead (11 ms per window vs. 8610 ms total LLM latency for LLaMA2-13B), making real-time application practical.

5. Comparative Performance and Quantitative Results

ISRTF is quantitatively benchmarked against FCFS and an oracle SJF scheduler (knowing true output lengths) using realistic Poisson-like request streams derived from FabriX service traces (Gamma(α=0.73, β=10.41)\mathit{Gamma}(\alpha=0.73,\,\beta=10.41)). Metrics include average job completion time (JCT), queuing delay, and throughput.

Model Load × Avg FCFS JCT (s) ISRTF JCT (s) SJF JCT (s)
OPT-13B 1.0× 77.83 73.57 20.35
3.0× 116.46 98.74 43.63
5.0× 118.13 118.11 43.63
LLaMA2-13B 1.0× 240.25 212.60 70.55
3.0× 350.55 352.53 133.11
5.0× 451.59 377.29 125.59
Vicuna-13B 1.0× 93.42 73.43 32.34
3.0× 134.96 118.22 58.39
5.0× 144.23 131.38 60.98

ISRTF consistently achieves 7.36% average and up to 21.4% maximum reduction in JCT relative to FCFS. Major gains are attributed to decreased queuing delays (up to 16.8%). Scheduling overhead remains negligible (0.13% of total latency).

ELIS system-level benchmarks on H100 hardware show linear scalability with worker count: peak throughput (delay <0.5<0.5 s) increases from 2.31 RPS (10 workers) to 18.77 RPS (50 workers).

6. System Architecture and Integration

The ISRTF algorithm is deployed in the ELIS system as a Kubernetes-native service. The scheduler is implemented in a central frontend pod managing both the ISRTF logic and response-length prediction, while backend stateful worker pods run modified vLLM engines supporting:

  • Iteration-wise execution (fixed KK-token decode windows)
  • SRTF-based preemption logic (overriding FCFS defaults)
  • Robust pod-to-pod communication and memory management for efficient dispatch.

Preemption events are rare due to typical batch pool sizes and GPU KV-cache limits. Statefulness in backend pods ensures network address stability, critical for tight scheduler–worker coordination. Asynchronous scheduler logic supports high throughput and efficient multi-GPU scaling.

7. Practical Implications, Limitations, and Lessons

ISRTF offers an effective mitigation for head-of-line blocking in LLM inference serving at scale, requiring only moderate modifications to system architectures that already employ batched decoding. Scheduler overhead is minimal, and gains are robust across open models and production traffic patterns. The predictor-based approach is general: improvements to response-length estimation translate directly to scheduling gains.

A plausible implication is that prolonged windows (KK large) may trade off some scheduling optimality for batching efficiency, while small KK values approach classic SRTF at the cost of hardware inefficiency. Empirically, K=50K=50 represents a robust compromise in the evaluated systems.

ISRTF is most impactful under variable, bursty workloads with significant request length heterogeneity. In production, proper tuning of scheduling interval, predictor deployment, and system integration is crucial for achieving the documented performance benefits (Choi et al., 14 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ISRTF Scheduling.