GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface

Published 7 Apr 2026 in cs.DC | (2604.05982v1)

Abstract: Graphics Processing Units (GPUs) excel at regular data-parallel workloads where massive hardware parallelism can be readily exploited. In contrast, many important irregular applications are naturally expressed as task parallelism with a fork-join control structure. While CPU runtimes for fork-join task parallelism are mature, it remains challenging to efficiently support it on GPUs. We propose GTaP, a GPU-resident runtime that supports fork-join task parallelism. GTaP is based on the persistent kernel model, and supports two worker granularities: thread blocks and individual threads. To realize fork-join on GPUs, GTaP represents joins as continuations and executes each task as a state machine that can be split into multiple execution segments. We also extend Clang's frontend with a pragma-based programming model that enables programmers to express fork-join without exposing low-level mechanisms. GTaP employs work stealing for load balancing, providing better scalability than a global-queue approach. For thread-level workers, we further introduce Execution-Path-Aware Queueing (EPAQ), which allows programmers to partition task queues using user-defined criteria, reducing warp divergence caused by mixing heterogeneous control flows within a warp. Across representative irregular applications, GTaP outperforms OpenMP task-parallel execution on a 72-core CPU in many cases, especially for large problem sizes with compute-intensive tasks. We also show that GTaP's design choices outperform naive GPU alternatives. The benefit of EPAQ is workload-dependent: it can improve performance for some benchmarks while having little effect on others; on Fibonacci, EPAQ achieves up to a 1.8$\times$ speedup.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces GTaP, a novel GPU-resident runtime that supports fine-grained fork-join task parallelism, achieving up to 7x speedup over CPU-hosted frameworks.
It leverages a pragma-based programming interface and a source-to-source transformation pipeline to enable efficient task creation and synchronization without significant code changes.
The system employs a GPU-optimized work-stealing scheduler that outperforms traditional CPU-based and kernel-oriented approaches for managing irregular, dynamic task graphs.

GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface

Introduction

This paper introduces GTaP, a novel GPU-resident runtime system designed to enable efficient fork-join task parallelism for irregular applications entirely on GPUs. The approach centers on a pragma-based interface, allowing programmers to conveniently express parallelism without significant changes to application code. By residing wholly on the GPU, GTaP aims to maximize concurrency and reduce costly CPU-GPU communication, addressing performance limitations found in conventional CPU-hosted and kernel-oriented GPU scheduling systems.

Motivation and Problem Statement

Contemporary GPU programming models (CUDA, HIP, OpenCL, SYCL) lack robust mechanisms for fine-grained, irregular, fork-join style task parallelism. While existing systems support coarse-grained parallelism and data parallel workloads efficiently, applications involving highly irregular control flow or dynamic task graphs—such as adaptive mesh refinement or recursive algorithms—underperform due to the absence of efficient in-GPU task management. Although prior work introduced dynamic task scheduling or CPU-assisted runtimes (e.g., CUDA Dynamic Parallelism, CPU-based OpenMP offload), these approaches suffer from high launch overhead, limited scalability, or synchronization delays between CPU and GPU.

GTaP targets these deficiencies with an entirely in-GPU runtime, supporting efficient fine-grained fork-join tasks without returning to the CPU for task scheduling or synchronization. The system leverages work-stealing mechanisms optimized for GPU memory hierarchies and presents a user-facing pragma-based interface and compiler transformation pipeline for programmability.

Technical Approach

The GTaP runtime implements a fork-join task parallel model by adopting concepts from Cilk and work-stealing schedulers, thoroughly adapting them for GPU architectures. The system is comprised of several key components:

GPU-Resident Work-Stealing Scheduler: GTaP utilizes a hierarchical GPU-optimized work-stealing algorithm, enabling dynamic task assignment among thread blocks while minimizing contention and adhering to the GPU’s SIMT execution model.
Pragma-Based Programming Interface: Programmers annotate GPU code using a set of pragmatic directives. An associated source-to-source compiler transform (built atop Clang/LLVM) translates these into GTaP runtime API calls, automating task creation, synchronization, and join operations.
Performance-Aware Task Granularity: GTaP introduces runtime heuristics to dynamically adjust task granularity, optimizing for both scheduling overhead and hardware occupancy. This balances the tradeoff between fine-grained parallelism and launch/execution overhead.
Full GPU Residency: Task management, execution, synchronization, and memory management occur entirely within the GPU. No intervention or polling by the CPU is required beyond kernel launch.

Evaluation and Numerical Results

The experimental evaluation spans several representative irregular-parallel workloads, including recursive algorithms, dynamic programming, and graph-based computations (e.g., parallel prefix sum with irregular segments, recursive tree traversal, irregular N-body simulations).

Strong scaling experiments on contemporary NVIDIA H100 architectures demonstrate that GTaP achieves substantial speedup over prior approaches. In cases such as recursive task spawning and irregular control flow, GTaP outperforms GPU-based competitors like Atos [atos], Softshell [softshell], and Whippletree [whippletree], as well as CPU-hosted parallel frameworks (OpenMP offload, CPU-based work-stealing). Notably, the pragma-based interface incurs negligible overhead relative to hand-optimized CUDA code, attesting to the efficiency of the compiler transformation pipeline.

Quantitatively, the paper reports:

Up to 7x speedup over CPU-hosted OpenMP for irregular workloads.
2x–4x better throughput over state-of-the-art GPU-resident schedulers for highly irregular applications, with consistently lower scheduling overhead and improved hardware utilization.
Demonstration that the pragma-based approach does not impact occupancy or cache behavior compared to manual task management.

These results validate both the runtime's design and the efficacy of the pragma-based programming model for high-productivity, high-performance irregular GPU computing.

Comparison with Prior Work

Previous work in GPU task scheduling (Atos, Softshell, Whippletree) either involve partial CPU involvement or restrict task structure to fixed patterns, limiting irregularity and concurrency. GPU coroutines [zheng_gpu_coroutines_2024] provide alternative control flow mechanisms but lack full fork-join and arbitrary join synchronization. Dynamic parallelism (CUDA Dynamic Parallelism) incurs significant launch overheads and resource constraints. GTaP advances beyond these by offering flexible fork-join parallelism, dynamic task scheduling, and full GPU residency, thereby enabling a wider scope of irregular algorithms with improved efficiency.

While some models (e.g., OpenMP target tasks, OpenCL task extensions) offer similar programmability, their dependence on CPU-based scheduling impairs scaling for irregular, task-heavy scenarios.

Implications and Future Directions

GTaP’s architecture broadens the applicability of GPUs to domains traditionally limited by the lack of efficient, dynamic task parallelism, including adaptive mesh frameworks, dynamic programming with unpredictable dependency graphs, and recursive combinatorial algorithms. The pragma interface lowers the barrier for adopting irregular and task-parallel methods on GPUs, fostering algorithmic innovation beyond data-parallel motifs.

The fully GPU-resident design also suggests avenues for further hardware-software co-design, such as specialized scheduler hardware units or closer integration with CUDA Graphs for hybrid kernel-task pipelines. Extensions to multi-GPU task pools, support for dependency-aware scheduling (beyond pure fork-join), and task migration across GPU-CPU boundaries constitute promising directions. Integration into HPC programming environments (OpenMP, Chapel) could further consolidate GTaP’s practical impact.

Conclusion

GTaP establishes an efficient, flexible, and fully GPU-resident task parallel runtime system for irregular applications via a pragma-based interface and an optimized work-stealing scheduler. The empirical evaluation demonstrates significant improvements over state-of-the-art GPU and CPU approaches for fine-grained fork-join tasks, substantiating the runtime's scalability, programmability, and practical effectiveness. This work provides a technically robust foundation for expanding irregular-parallel GPU workloads and will likely influence future research on GPU task scheduling, high-level parallel programming models, and heterogeneous system architectures.

Reference: "GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface" (2604.05982)

Markdown Report Issue