Papers
Topics
Authors
Recent
Search
2000 character limit reached

Iterative SRTF Scheduler for GPU Kernels

Updated 15 January 2026
  • Iterative SRTF Scheduler is a preemptive GPU scheduling policy that leverages online runtime prediction to dynamically allocate thread blocks among competing kernels.
  • It employs a staircase model and simple slicing technique to monitor per-kernel progress and accurately estimate remaining execution time.
  • Empirical evaluations demonstrate that iterative SRTF significantly improves throughput and fairness over FIFO, closely approximating the optimal SJF performance.

An iterative Shortest Remaining Time First (SRTF) scheduler in the context of concurrent GPU kernel execution is a preemptive policy that leverages online runtime prediction to dynamically steer thread block allocation among competing kernels. In contrast to the default non-preemptive FIFO mapping on contemporary GPUs, iterative SRTF seeks to maximize system throughput and fairness by always selecting the kernel with the least estimated work remaining. Achieving this requires tightly integrated structures for monitoring per-kernel progress, predicting runtime at microsecond granularity, and dispatching thread blocks in a preemptive, priority-ordered fashion based on online estimates refined during execution (Pai et al., 2014).

1. Limitations of FIFO and the Justification for SRTF

On modern NVIDIA GPUs such as Fermi and Kepler, the hardware Thread Block Scheduler (TBS) can simultaneously execute multiple kernels by dispatching their thread blocks across several Streaming Multiprocessors (SMs). The default scheduling policy is non-preemptive FIFO: a newly launched kernel must wait for all thread blocks of preceding kernels to be dispatched before execution begins. This approach has been shown to severely degrade both turnaround time and fairness in multi-programmed environments. Specifically, FIFO serializes short kernels behind long ones, causing significant turnaround slowdowns, and makes overall throughput highly dependent on kernel arrival order. Furthermore, it introduces starvation for short jobs when long jobs arrive first. In contrast, the idealized Shortest Job First (SJF) policy, which always executes the kernel with the minimal remaining work, offers optimal performance and fairness but requires perfect a priori knowledge of each kernel’s runtime, which is unavailable in practice (Pai et al., 2014).

2. The Staircase Model for Runtime Estimation

SRTF in the GPU context adopts a highly-structured approach to runtime prediction rooted in the regularity of GPU grid launches. Each kernel consists of GG thread blocks distributed evenly across SS SMs, where each SM can host up to RR blocks concurrently. The per-block execution latency is denoted bb, and typical launch parameters (i.e., GG, SS, RR) are static and discoverable at kernel launch. The total time for one SM to process its allocated blocks is expressed as:

TSM=(G/S)/RbT_\text{SM} = \lceil (G/S)/R \rceil \cdot b

For a uniformly pipelined execution, the model simplifies kernel runtime to:

TG/(SR)bT \approx \lceil G / (S R) \rceil b

or under a coarse assumption (R=1R = 1):

T(G/S)bT \approx (G/S)\, b

This "Staircase" model provides a tractable and low-overhead foundation for estimating kernel runtime dynamically, with the only significant unknown (before runtime) being the average per-block latency bb (Pai et al., 2014).

3. Online Structural Runtime Predictor and "Simple Slicing"

Exploiting the homogeneity of thread block workloads, the iterative SRTF scheduler employs an online “Simple Slicing” runtime predictor on each SM. This structure maintains:

  • Total_Blocks: Ceil(G/SG/S), total blocks assigned to the SM.
  • Done_Blocks: Number of blocks completed by the SM.
  • Resident_Blocks: Active residency, initially set to RR.
  • tt: Observed latency (cycles) of a representative completed block in the current "slice."
  • Active_Cycles: Actual execution cycles accrued by the kernel on the SM.

A "slice" is defined as an epoch where conditions such as residency, competing kernels, and arrivals/departures are stable and tt does not drift. On block completion within a slice, tt is resampled. The predictor estimates remaining cycles for a given kernel on an SM as:

Pred_Cycles=Active_Cycles+(Total_BlocksDone_Blocks)tResident_Blocks\text{Pred\_Cycles} = \text{Active\_Cycles} + \frac{(\text{Total\_Blocks} - \text{Done\_Blocks}) \cdot t}{\text{Resident\_Blocks}}

To generate a global remaining-time estimate R^\hat{R} for a kernel, the dispatch system adopts the maximum Pred_Cycles\text{Pred\_Cycles} among all SMs. This estimation process begins immediately after the completion of the first block of any arriving kernel and is continuously refined as execution progresses. Empirically, prediction errors typically range within a factor of 0.5–2, but since SRTF only requires relative ordering for remaining times, this granularity suffices for effective scheduling (Pai et al., 2014).

4. Preemptive Thread-Block Scheduling Procedure

The SRTF scheduling system maintains a priority queue of runnable kernels, keyed by their current Pred_Cycles\text{Pred\_Cycles} (remaining time). The scheduling algorithm proceeds as follows:

  1. Recompute Pred_Cycles\text{Pred\_Cycles} for all kernels across SMs, collecting the max predictions for each kernel.
  2. Select the kernel KK^* with the smallest predicted remaining cycles.
  3. Preemption Logic: If KK^* differs from the SM's current kernel, halt the issuance of further blocks for the incumbent and start issuing blocks from KK^*, subject to resource constraints. Ongoing blocks from the previous kernel continue to completion due to hardware non-preemptability.
  4. Sampling for New Kernels: When a new kernel arrives, an initial runtime estimate is obtained by sampling its first block(s) on one SM as soon as resources are available. This enables prompt and fair comparison with existing kernels’ remaining times and allows rapid switchover if the new kernel has less remaining work.

This mechanism is iterative, continuously updating estimates and dynamically steering block issuance order in response to estimated remaining times (Pai et al., 2014).

5. Evaluation Metrics and Empirical Outcomes

The effectiveness of iterative SRTF scheduling is quantitatively evaluated through three principal metrics:

Metric Definition Desired Effect
System Throughput (STP) i(Cialone/Cishared)\displaystyle \sum_i \left(C_i^\text{alone} / C_i^\text{shared}\right) Higher is better
Average Normalized Turnaround Time (ANTT) (1/P)i(Tishared/Tialone)\displaystyle (1/P) \sum_i \left(T_i^\text{shared} / T_i^\text{alone}\right) Lower is better
Fairness (StrictF) F=(mini slowdowni)/(maxi slowdowni)\displaystyle F = (\min_i \ \text{slowdown}_i) / (\max_i \ \text{slowdown}_i) Closer to 1 is better

SRTF improves STP by 1.18× and ANTT by 2.25× over FIFO. When benchmarked against MPMax, a contemporary resource allocation policy for concurrent kernels, SRTF further improves STP and ANTT. With the adaptive variant (SRTF/Adaptive), which regulates per-SM residency to improve fairness, the system also achieves a 2.95× improvement in fairness, regaining threefold fairness over FIFO at the cost of only ~5% STP reduction compared to pure SRTF (Pai et al., 2014). Overall, SRTF narrows the performance gap to the theoretical SJF oracle by about half, reducing turnaround times and balancing co-running kernel slowdowns.

6. Implementation Considerations and SRTF/Adaptive Enhancements

Preemption Overheads

Due to hardware constraints, thread block preemption is coarse-grained. Upon a scheduler switch, the system must wait for all resident blocks of the previous kernel to complete on an SM before populating it with blocks from the new kernel. The “handoff delay” is therefore determined by the per-block latency and resident block count. Zero-sampling simulation results indicate up to 3% additional throughput could be gained in an ideal, fully preemptive system.

Dynamic Resource Sharing and Fairness

Greedy SRTF may result in complete starvation of long kernels, prompting the need for adaptive scheduling. SRTF/Adaptive supplements the base policy with the following mechanism:

  • Periodically monitor instant per-kernel slowdown.
  • When (max slowdownimin slowdowni)>τ(\max \ \text{slowdown}_i - \min \ \text{slowdown}_i) > \tau (with τ\tau a threshold, e.g., 0.5), switch to "shared mode," capping each kernel’s residency per SM (e.g., Ri=Rtotal/#kernelsR_i = \lfloor R_\text{total} / \#\text{kernels} \rfloor).
  • Continue SRTF within each residency cap, updating remaining-time estimates as usual.

This regime ensures no kernel is starved while retaining most throughput benefits of pure SRTF.

Co-Runner Variability, Slices, and Estimation Robustness

Changes in per-block latency (tt) caused by new arrivals or resource reallocation require the system to segment execution into “slices,” with prediction intervals reset at each major event. SRTF tracks and compensates for SM-level execution and cache variance by maintaining independent predictors per SM and selecting the global maximum for each kernel's remaining time (Pai et al., 2014).

7. Cohesive System Design and Impact

The iterative SRTF scheduler integrates:

  • A low-overhead, structure-aware predictor utilizing the Staircase execution model
  • Per-SM monitoring with event-driven (“slice”) updates to maintain prediction accuracy
  • A global, priority-based preemptive block dispatch system informed by continuous online estimation
  • Optional fairness-adaptive capping to balance throughput-fairness trade-offs

Empirical assessment demonstrates that these elements collectively close approximately 49% of the FIFO–SJF optimality gap, more than double FIFO fairness, and significantly shorten mean turnaround for short kernels—all with modest additional hardware or software requirements to the TBS and predictor logic (Pai et al., 2014).


For further detail and implementation specifics, see "Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels" (Pai et al., 2014).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative SRTF Scheduler.