Iterative SRTF Scheduler for GPU Kernels
- Iterative SRTF Scheduler is a preemptive GPU scheduling policy that leverages online runtime prediction to dynamically allocate thread blocks among competing kernels.
- It employs a staircase model and simple slicing technique to monitor per-kernel progress and accurately estimate remaining execution time.
- Empirical evaluations demonstrate that iterative SRTF significantly improves throughput and fairness over FIFO, closely approximating the optimal SJF performance.
An iterative Shortest Remaining Time First (SRTF) scheduler in the context of concurrent GPU kernel execution is a preemptive policy that leverages online runtime prediction to dynamically steer thread block allocation among competing kernels. In contrast to the default non-preemptive FIFO mapping on contemporary GPUs, iterative SRTF seeks to maximize system throughput and fairness by always selecting the kernel with the least estimated work remaining. Achieving this requires tightly integrated structures for monitoring per-kernel progress, predicting runtime at microsecond granularity, and dispatching thread blocks in a preemptive, priority-ordered fashion based on online estimates refined during execution (Pai et al., 2014).
1. Limitations of FIFO and the Justification for SRTF
On modern NVIDIA GPUs such as Fermi and Kepler, the hardware Thread Block Scheduler (TBS) can simultaneously execute multiple kernels by dispatching their thread blocks across several Streaming Multiprocessors (SMs). The default scheduling policy is non-preemptive FIFO: a newly launched kernel must wait for all thread blocks of preceding kernels to be dispatched before execution begins. This approach has been shown to severely degrade both turnaround time and fairness in multi-programmed environments. Specifically, FIFO serializes short kernels behind long ones, causing significant turnaround slowdowns, and makes overall throughput highly dependent on kernel arrival order. Furthermore, it introduces starvation for short jobs when long jobs arrive first. In contrast, the idealized Shortest Job First (SJF) policy, which always executes the kernel with the minimal remaining work, offers optimal performance and fairness but requires perfect a priori knowledge of each kernel’s runtime, which is unavailable in practice (Pai et al., 2014).
2. The Staircase Model for Runtime Estimation
SRTF in the GPU context adopts a highly-structured approach to runtime prediction rooted in the regularity of GPU grid launches. Each kernel consists of thread blocks distributed evenly across SMs, where each SM can host up to blocks concurrently. The per-block execution latency is denoted , and typical launch parameters (i.e., , , ) are static and discoverable at kernel launch. The total time for one SM to process its allocated blocks is expressed as:
For a uniformly pipelined execution, the model simplifies kernel runtime to:
or under a coarse assumption ():
This "Staircase" model provides a tractable and low-overhead foundation for estimating kernel runtime dynamically, with the only significant unknown (before runtime) being the average per-block latency (Pai et al., 2014).
3. Online Structural Runtime Predictor and "Simple Slicing"
Exploiting the homogeneity of thread block workloads, the iterative SRTF scheduler employs an online “Simple Slicing” runtime predictor on each SM. This structure maintains:
- Total_Blocks: Ceil(), total blocks assigned to the SM.
- Done_Blocks: Number of blocks completed by the SM.
- Resident_Blocks: Active residency, initially set to .
- : Observed latency (cycles) of a representative completed block in the current "slice."
- Active_Cycles: Actual execution cycles accrued by the kernel on the SM.
A "slice" is defined as an epoch where conditions such as residency, competing kernels, and arrivals/departures are stable and does not drift. On block completion within a slice, is resampled. The predictor estimates remaining cycles for a given kernel on an SM as:
To generate a global remaining-time estimate for a kernel, the dispatch system adopts the maximum among all SMs. This estimation process begins immediately after the completion of the first block of any arriving kernel and is continuously refined as execution progresses. Empirically, prediction errors typically range within a factor of 0.5–2, but since SRTF only requires relative ordering for remaining times, this granularity suffices for effective scheduling (Pai et al., 2014).
4. Preemptive Thread-Block Scheduling Procedure
The SRTF scheduling system maintains a priority queue of runnable kernels, keyed by their current (remaining time). The scheduling algorithm proceeds as follows:
- Recompute for all kernels across SMs, collecting the max predictions for each kernel.
- Select the kernel with the smallest predicted remaining cycles.
- Preemption Logic: If differs from the SM's current kernel, halt the issuance of further blocks for the incumbent and start issuing blocks from , subject to resource constraints. Ongoing blocks from the previous kernel continue to completion due to hardware non-preemptability.
- Sampling for New Kernels: When a new kernel arrives, an initial runtime estimate is obtained by sampling its first block(s) on one SM as soon as resources are available. This enables prompt and fair comparison with existing kernels’ remaining times and allows rapid switchover if the new kernel has less remaining work.
This mechanism is iterative, continuously updating estimates and dynamically steering block issuance order in response to estimated remaining times (Pai et al., 2014).
5. Evaluation Metrics and Empirical Outcomes
The effectiveness of iterative SRTF scheduling is quantitatively evaluated through three principal metrics:
| Metric | Definition | Desired Effect |
|---|---|---|
| System Throughput (STP) | Higher is better | |
| Average Normalized Turnaround Time (ANTT) | Lower is better | |
| Fairness (StrictF) | Closer to 1 is better |
SRTF improves STP by 1.18× and ANTT by 2.25× over FIFO. When benchmarked against MPMax, a contemporary resource allocation policy for concurrent kernels, SRTF further improves STP and ANTT. With the adaptive variant (SRTF/Adaptive), which regulates per-SM residency to improve fairness, the system also achieves a 2.95× improvement in fairness, regaining threefold fairness over FIFO at the cost of only ~5% STP reduction compared to pure SRTF (Pai et al., 2014). Overall, SRTF narrows the performance gap to the theoretical SJF oracle by about half, reducing turnaround times and balancing co-running kernel slowdowns.
6. Implementation Considerations and SRTF/Adaptive Enhancements
Preemption Overheads
Due to hardware constraints, thread block preemption is coarse-grained. Upon a scheduler switch, the system must wait for all resident blocks of the previous kernel to complete on an SM before populating it with blocks from the new kernel. The “handoff delay” is therefore determined by the per-block latency and resident block count. Zero-sampling simulation results indicate up to 3% additional throughput could be gained in an ideal, fully preemptive system.
Dynamic Resource Sharing and Fairness
Greedy SRTF may result in complete starvation of long kernels, prompting the need for adaptive scheduling. SRTF/Adaptive supplements the base policy with the following mechanism:
- Periodically monitor instant per-kernel slowdown.
- When (with a threshold, e.g., 0.5), switch to "shared mode," capping each kernel’s residency per SM (e.g., ).
- Continue SRTF within each residency cap, updating remaining-time estimates as usual.
This regime ensures no kernel is starved while retaining most throughput benefits of pure SRTF.
Co-Runner Variability, Slices, and Estimation Robustness
Changes in per-block latency () caused by new arrivals or resource reallocation require the system to segment execution into “slices,” with prediction intervals reset at each major event. SRTF tracks and compensates for SM-level execution and cache variance by maintaining independent predictors per SM and selecting the global maximum for each kernel's remaining time (Pai et al., 2014).
7. Cohesive System Design and Impact
The iterative SRTF scheduler integrates:
- A low-overhead, structure-aware predictor utilizing the Staircase execution model
- Per-SM monitoring with event-driven (“slice”) updates to maintain prediction accuracy
- A global, priority-based preemptive block dispatch system informed by continuous online estimation
- Optional fairness-adaptive capping to balance throughput-fairness trade-offs
Empirical assessment demonstrates that these elements collectively close approximately 49% of the FIFO–SJF optimality gap, more than double FIFO fairness, and significantly shorten mean turnaround for short kernels—all with modest additional hardware or software requirements to the TBS and predictor logic (Pai et al., 2014).
For further detail and implementation specifics, see "Preemptive Thread Block Scheduling with Online Structural Runtime Prediction for Concurrent GPGPU Kernels" (Pai et al., 2014).