Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads

Published 28 Jan 2026 in cs.DC and cs.OS | (2601.20435v1)

Abstract: The convergence of high-performance computing (HPC) and AI is driving the emergence of increasingly complex parallel applications and workloads. These workloads often combine multiple parallel runtimes within the same application or across co-located jobs, creating scheduling demands that place significant stress on traditional OS schedulers. When oversubscribed (there are more ready threads than cores), OS schedulers rely on periodic preemptions to multiplex cores, often introducing interference that may degrade performance. In this paper, we present: (1) The User-space Scheduling Framework (USF), a novel seamless process scheduling framework completely implemented in user-space. USF enables users to implement their own process scheduling algorithms without requiring special permissions. We evaluate USF with its default cooperative policy, (2) SCHED_COOP, designed to reduce interference by switching threads only upon blocking. This approach mitigates well-known issues such as Lock-Holder Preemption (LHP), Lock-Waiter Preemption (LWP), and scalability collapse. We implement USF and SCHED_COOP by extending the GNU C library with the nOS-V runtime, enabling seamless coordination across multiple runtimes (e.g., OpenMP) without requiring invasive application changes. Evaluations show gains up to 2.4x in oversubscribed multi-process scenarios, including nested BLAS workloads, multi-process PyTorch inference with LLaMA-3, and Molecular Dynamics (MD) simulations.

Abstract PDF Upgrade to Chat

Summary

The paper introduces USF, a user-space framework that deploys custom scheduling policies without kernel modifications, reducing interference in oversubscribed environments.
It demonstrates up to 28% speedup and a remarkable 14.7x improvement in complex multi-runtime and multi-process scenarios.
SCHED_COOP, the default policy, mitigates lock preemption issues and enhances throughput in HPC and AI workloads by enabling cooperative, non-preemptive scheduling.

User-Space Coordination for Multi-Runtime and Multi-Process Oversubscription

Motivation and Context

The convergence of HPC and AI workloads has introduced new challenges in resource management, particularly with thread scheduling in oversubscribed environments where multiple parallel runtimes and application processes contend for finite compute resources. Traditional OS schedulers, including the Linux EEVDF and CFS-based designs, enforce fairness by periodic preemption but are fundamentally application-agnostic. This limits their capacity to minimize interference, especially under complex, nested, and heterogeneous runtime compositions. Lock-Holder Preemption (LHP), Lock-Waiter Preemption (LWP), and scalability collapse are persistent issues, and developers are often forced into exclusive node usage or explicit resource partitioning to avoid pathological scheduling interference, which reduces overall system throughput and flexibility. Unified runtime approaches and kernel-level scheduling customizations exist, but their adoption is hindered by API and privilege requirements, lack of transparency, and software heterogeneity across the HPC ecosystem.

USF and SCHED_COOP: Design and Implementation

The paper introduces the User-space Scheduling Framework (USF), a highly transparent, fully user-space approach that enables the deployment of custom scheduling policies for POSIX-threads applications without requiring privileged operations or kernel modifications. USF is realized by extending the GNU C library (glibc), resulting in a "glibcv" API interposition layer where all pthread-related operations—creation, blocking, affinity hints, fork, and synchronization—are redirected to a nOS-V runtime core. This core, leveraging shared memory for multi-process coordination, maintains a centralized scheduler that coordinates logical entities (tasks) across both threads and processes, providing seamless multi-runtime and multi-process scheduling.

SCHED_COOP is presented as the default scheduling policy for USF. It mimics real-time cooperative semantics and guarantees that threads only yield the core when they block, terminate, or call yield explicitly. Notably, unlike kernel SCHED_RR, SCHED_COOP is fully available to unprivileged users and ensures LWP/LHP cannot occur between participating threads. There is no involuntary preemption among SCHED_COOP threads. TLS compatibility is maintained through careful task-worker affinity management.

Modifications to Synchronization and Blocking APIs

To ensure maximum transparency and coverage, the framework instruments all standard glibc synchronization primitives. For instance, it augments pthread_mutex_t with FIFO wait queues and uses nOS-V to suspend and resume waiters at precise scheduling points, triggering worker swaps as necessary. Blocking operations, including polling and sleep primitives, are similarly intercepted. Furthermore, a caching strategy for pthreads is implemented, akin to thread pooling, to amortize thread (de)allocation overhead in thread-intensive scenarios.

A key limitation is the inability to transparently intercept busy-wait custom barriers, since these do not utilize standard blocking primitives. Although not unique to USF, this limits the applicability of cooperative user-space scheduling in certain legacy and highly-optimized code; the proposed solution is to advocate for yield insertion on busy-wait paths, which is already beneficial in oversubscribed cases even under the Linux scheduler.

Evaluation: Multi-Runtime and Multi-Process Scenarios

The evaluation comprehensively benchmarks USF and SCHED_COOP across nested runtime (e.g., OpenMP in BLAS called from parallel OmpSs-2/Nanos6 tasks), and multi-process oversubscription cases. All experiments are performed on a high-end dual-socket Intel Sapphire Rapids system without kernel modifications.

Strong numerical results are presented:

In a nested Matrix Multiplication case spanning Nanos6 and BLIS-OpenMP, SCHED_COOP enables up to 28% speedup over an optimized Linux-scheduler baseline, and up to 11.8% over the best single-runtime baseline, in the region of meaningful oversubscription. The performance improvements are attributed to the ability of user-space cooperative scheduling to avoid wasteful preemption and destructive interference at critical synchronization points and to aggressive elimination of idle busy-wait time.
A manual integration with direct nOS-V API usage yields marginal additional gains, establishing a practical upper bound for seamless USF deployment.
For complex runtime compositions (OpenMP, LLVM libomp, pthreadpool), SCHED_COOP achieves up to 14.7x speedup in severely oversubscribed Cholesky configurations where naive pthread implementations caused catastrophic resource thrashing due to thread creation/destruction and suboptimal scheduler interaction.
In realistic multi-process AI inference (multi-service Python workloads using PyTorch/BLAS), SCHED_COOP outperforms even carefully tuned partitioning heuristics, sustaining both high throughput and low latency as request rates rise and the system becomes highly oversubscribed, with up to 2.4x performance gains over the best baseline.
In dual-ensemble LAMMPS/DeePMD-kit molecular dynamics simulations, SCHED_COOP achieves the highest system throughput and memory bandwidth, outperforming both colocated and co-executed configurations that use partitioned resources or rely on the default scheduler.
Figure 1: Overview of key performance gains across oversubscribed multi-runtime and multi-process evaluation scenarios under SCHED_COOP.

Implications and Discussion

The results demonstrate the significant untapped efficiency in oversubscribed workloads when thread scheduling is exposed to user-driven, application-aware policies at the user-space level. USF, by operating just above the kernel, maintains full compatibility with existing tooling and the full range of POSIX applications, with no requirement for kernel modifications or special privileges. SCHED_COOP, as a cooperative scheduling policy, mitigates classical performance pathologies such as LHP/LWP, context-switch overheads, and destructive preemption, enabling legitimate and productive use of oversubscription in modern multi-runtime, multi-process HPC/AI workloads.

USF's compatibility with Thread-Local Storage and its extensibility towards custom blocking primitives addresses pain points found in prior user-space thread management systems (e.g., ULTs and lightweight threading libraries), which typically cannot transparently support complex, TLS-dependent or high-level parallel programming models.

Practical implications include:

Drastically reduced need for ad hoc resource partitioning or exclusive node locking in shared clusters.
Enhanced composability and integration of heterogeneous parallel programming models, as each can retain its optimal runtime structure without risking pathological oversubscription performance collapse.
Acceleration of AI inference microservices and scientific ensemble workflows (e.g., multiple co-evolving MD simulations) on resource-rich but highly contended infrastructure.

Theoretically, the work reopens the design space for system-level scheduling in user space, demonstrating that kernel-level scheduling is not the only locus for high-efficiency, robust scheduling under extreme concurrency. USF's approach is orthogonal—rather than competitive—to OS-level scheduler extensibility frameworks, and offers a deployment model that aligns with the constraints of production systems.

Limitations and Future Directions

While USF and SCHED_COOP substantially advance practical scheduling for oversubscribed concurrent workloads, several technical limitations remain. First, busy-wait barrier detection relies on manual code adaptation; full transparency may require binary or lower-level runtime instrumentation. Second, blocking I/O (e.g., MPI or filesystem) is not currently handled within the USF cooperative semantics, potentially stalling cores. Finally, the shared-memory approach for multi-process coordination imposes minimal security risks that are managed but may be further addressed by kernel-level integration.

Planned extensions include supporting futex-based synchronization, adding io_uring-powered I/O integration (as in recent asynchronous task-aware I/O systems), and exploring a native in-kernel implementation of SCHED_COOP for further reductions in scheduling latency.

Conclusion

USF and its SCHED_COOP policy provide an extensible, transparent framework for user-space thread and process scheduling in heavily oversubscribed HPC and AI environments. The system unlocks the composition of multiple parallel runtimes and workloads within the same node, achieving up to 2.4x system throughput improvements—and significantly higher in pathological baseline cases—without kernel changes or privileged operations. This paradigm challenges the conventional wisdom of strictly partitioned scheduling domains and offers a concrete, adoptable path toward more efficient, composable, and application-aware parallel execution at scale.

Reference: "Rethinking Thread Scheduling under Oversubscription: A User-Space Framework for Coordinating Multi-runtime and Multi-process Workloads" (2601.20435).

Markdown Report Issue