- The paper introduces USF, a user-space framework that deploys custom scheduling policies without kernel modifications, reducing interference in oversubscribed environments.
- It demonstrates up to 28% speedup and a remarkable 14.7x improvement in complex multi-runtime and multi-process scenarios.
- SCHED_COOP, the default policy, mitigates lock preemption issues and enhances throughput in HPC and AI workloads by enabling cooperative, non-preemptive scheduling.
User-Space Coordination for Multi-Runtime and Multi-Process Oversubscription
Motivation and Context
The convergence of HPC and AI workloads has introduced new challenges in resource management, particularly with thread scheduling in oversubscribed environments where multiple parallel runtimes and application processes contend for finite compute resources. Traditional OS schedulers, including the Linux EEVDF and CFS-based designs, enforce fairness by periodic preemption but are fundamentally application-agnostic. This limits their capacity to minimize interference, especially under complex, nested, and heterogeneous runtime compositions. Lock-Holder Preemption (LHP), Lock-Waiter Preemption (LWP), and scalability collapse are persistent issues, and developers are often forced into exclusive node usage or explicit resource partitioning to avoid pathological scheduling interference, which reduces overall system throughput and flexibility. Unified runtime approaches and kernel-level scheduling customizations exist, but their adoption is hindered by API and privilege requirements, lack of transparency, and software heterogeneity across the HPC ecosystem.
USF and SCHED_COOP: Design and Implementation
The paper introduces the User-space Scheduling Framework (USF), a highly transparent, fully user-space approach that enables the deployment of custom scheduling policies for POSIX-threads applications without requiring privileged operations or kernel modifications. USF is realized by extending the GNU C library (glibc), resulting in a "glibcv" API interposition layer where all pthread-related operations—creation, blocking, affinity hints, fork, and synchronization—are redirected to a nOS-V runtime core. This core, leveraging shared memory for multi-process coordination, maintains a centralized scheduler that coordinates logical entities (tasks) across both threads and processes, providing seamless multi-runtime and multi-process scheduling.
SCHED_COOP is presented as the default scheduling policy for USF. It mimics real-time cooperative semantics and guarantees that threads only yield the core when they block, terminate, or call yield explicitly. Notably, unlike kernel SCHED_RR, SCHED_COOP is fully available to unprivileged users and ensures LWP/LHP cannot occur between participating threads. There is no involuntary preemption among SCHED_COOP threads. TLS compatibility is maintained through careful task-worker affinity management.
Modifications to Synchronization and Blocking APIs
To ensure maximum transparency and coverage, the framework instruments all standard glibc synchronization primitives. For instance, it augments pthread_mutex_t with FIFO wait queues and uses nOS-V to suspend and resume waiters at precise scheduling points, triggering worker swaps as necessary. Blocking operations, including polling and sleep primitives, are similarly intercepted. Furthermore, a caching strategy for pthreads is implemented, akin to thread pooling, to amortize thread (de)allocation overhead in thread-intensive scenarios.
A key limitation is the inability to transparently intercept busy-wait custom barriers, since these do not utilize standard blocking primitives. Although not unique to USF, this limits the applicability of cooperative user-space scheduling in certain legacy and highly-optimized code; the proposed solution is to advocate for yield insertion on busy-wait paths, which is already beneficial in oversubscribed cases even under the Linux scheduler.
Evaluation: Multi-Runtime and Multi-Process Scenarios
The evaluation comprehensively benchmarks USF and SCHED_COOP across nested runtime (e.g., OpenMP in BLAS called from parallel OmpSs-2/Nanos6 tasks), and multi-process oversubscription cases. All experiments are performed on a high-end dual-socket Intel Sapphire Rapids system without kernel modifications.
Strong numerical results are presented:
Implications and Discussion
The results demonstrate the significant untapped efficiency in oversubscribed workloads when thread scheduling is exposed to user-driven, application-aware policies at the user-space level. USF, by operating just above the kernel, maintains full compatibility with existing tooling and the full range of POSIX applications, with no requirement for kernel modifications or special privileges. SCHED_COOP, as a cooperative scheduling policy, mitigates classical performance pathologies such as LHP/LWP, context-switch overheads, and destructive preemption, enabling legitimate and productive use of oversubscription in modern multi-runtime, multi-process HPC/AI workloads.
USF's compatibility with Thread-Local Storage and its extensibility towards custom blocking primitives addresses pain points found in prior user-space thread management systems (e.g., ULTs and lightweight threading libraries), which typically cannot transparently support complex, TLS-dependent or high-level parallel programming models.
Practical implications include:
- Drastically reduced need for ad hoc resource partitioning or exclusive node locking in shared clusters.
- Enhanced composability and integration of heterogeneous parallel programming models, as each can retain its optimal runtime structure without risking pathological oversubscription performance collapse.
- Acceleration of AI inference microservices and scientific ensemble workflows (e.g., multiple co-evolving MD simulations) on resource-rich but highly contended infrastructure.
Theoretically, the work reopens the design space for system-level scheduling in user space, demonstrating that kernel-level scheduling is not the only locus for high-efficiency, robust scheduling under extreme concurrency. USF's approach is orthogonal—rather than competitive—to OS-level scheduler extensibility frameworks, and offers a deployment model that aligns with the constraints of production systems.
Limitations and Future Directions
While USF and SCHED_COOP substantially advance practical scheduling for oversubscribed concurrent workloads, several technical limitations remain. First, busy-wait barrier detection relies on manual code adaptation; full transparency may require binary or lower-level runtime instrumentation. Second, blocking I/O (e.g., MPI or filesystem) is not currently handled within the USF cooperative semantics, potentially stalling cores. Finally, the shared-memory approach for multi-process coordination imposes minimal security risks that are managed but may be further addressed by kernel-level integration.
Planned extensions include supporting futex-based synchronization, adding io_uring-powered I/O integration (as in recent asynchronous task-aware I/O systems), and exploring a native in-kernel implementation of SCHED_COOP for further reductions in scheduling latency.
Conclusion
USF and its SCHED_COOP policy provide an extensible, transparent framework for user-space thread and process scheduling in heavily oversubscribed HPC and AI environments. The system unlocks the composition of multiple parallel runtimes and workloads within the same node, achieving up to 2.4x system throughput improvements—and significantly higher in pathological baseline cases—without kernel changes or privileged operations. This paradigm challenges the conventional wisdom of strictly partitioned scheduling domains and offers a concrete, adoptable path toward more efficient, composable, and application-aware parallel execution at scale.
Reference: "Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads" (2601.20435).