User-space Scheduling Framework (USF)
- User-space Scheduling Framework (USF) is a system that defines and enforces process/thread scheduling policies entirely in user space, ensuring fine-grained control.
- USF intercepts standard pthread APIs and adopts a cooperative scheduling model that minimizes involuntary preemptions, reducing cache pollution and synchronization delays.
- USF demonstrates significant performance gains in oversubscribed settings, showing up to 6.86× speedup in nested runtimes and enhanced throughput for multi-process AI tasks.
A User-space Scheduling Framework (USF) enables the definition, customization, and enforcement of process and thread scheduling policies entirely in user space, without modifications to the operating system kernel or requiring special privileges. USFs target environments where system-level scheduler abstractions are inadequate due to fine-grained parallelism, synchronization intensity, or highly variable workload characteristics. Such frameworks intercept or replace standard scheduling decisions, often integrating deeply with the runtime libraries of parallel programming frameworks, and are designed to avoid pathologies associated with kernel-space preemption—such as lock-holder and lock-waiter preemption—especially under oversubscribed conditions common in contemporary high-performance computing (HPC), AI, and cloud microservice environments (Roca et al., 28 Jan 2026).
1. Motivation and Problem Scope
Oversubscription, defined as the presence of more runnable threads than available CPU cores, exposes significant limitations in conventional kernel-level scheduling, such as Linux’s CFS or EEVDF. Preemptive, time-sliced kernel strategies introduce involuntary context switches that cause cache pollution, TLB shootdowns, and disrupt fine-grained synchronization. These effects lead to phenomena such as Lock-Holder Preemption (LHP) and Lock-Waiter Preemption (LWP), causing scalability collapse in nested or composite runtime scenarios (Roca et al., 28 Jan 2026). Existing alternatives, including kernel modifications via eBPF, ghOSt, or F4, impose prohibitive privilege and maintenance requirements. Pure user-level threading libraries can break compatibility with software depending on pthreads and thread-local storage (TLS). USF provides a transparent alternative: user-level scheduling with pthread/TLS support, multi-runtime and multi-process capability, and no need for kernel patches.
2. Framework Architecture and Integration
USF, as realized in (Roca et al., 28 Jan 2026), interposes on user-level threading via a modified GNU C Library (“glibcv”) and a runtime component (“nOS-V”) implementing centralized, per-core, and per-process scheduling queues. Core control flow comprises:
- Interception of standard pthread APIs (creation, exit, join, affinity manipulation).
- Registration of new threads with the user-level scheduler (nosv_attach) and lifecycle management (nosv_detach).
- Transparent interception of blocking operations (mutex, cond_wait, poll, sleep), which interface with scheduler logic (nosv_pause, nosv_submit).
- Management of run queues per core and synchronization object wait queues.
Thread affinity hints are recorded but not directly enforced at the OS level; instead, USF’s runtime manages the real mapping for scheduling efficiency and resource locality.
The framework provides a set of APIs (e.g., nosv_attach, nosv_pause, nosv_submit, nosv_waitfor) for task registration, pausing, resumption, and block with timeout.
3. Scheduling Policies and Algorithms
USF decouples scheduling policy from mechanism, allowing implementation of arbitrary policies in user space. The default policy, “SCHED_COOP” (“cooperative, run-until-block”), schedules threads non-preemptively until they block, yield (mapped to nosv_pause), or exit. This approach is constructed upon the following core concepts:
- Run Queues and Affinity: Each core maintains a FIFO of ready tasks. Upon becoming idle (e.g., task blocks), a core selects its next task, favoring prior affinity and NUMA locality.
- Cooperative Advancement: Threads are not interrupted involuntarily. Preemption occurs exclusively at explicit scheduling points (blocking, yield, termination).
- Synchronization Awareness: Blocking and unblocking primitives update wait queues and core run queues, allowing reduction of LHP/LWP pathologies.
- Per-process Quantization: For fair sharing among concurrent processes, a round-robin quantum (typically 20 ms) is employed.
Policy extensibility is intrinsic: users may define custom task selection logic (select_task), compile their policy into nOS-V, and register via initialization hooks.
4. Implementation Details and Kernel Interaction
USF’s implementation centers on a binary-compatible replacement for glibc (“glibcv”) with minimal kernel interaction. Blocking/unblocking of threads is handled via system call wrappers and, where necessary, hybrid spinning and timed waits (e.g., hybrid spin→nosv_waitfor(5 ms) for epoll timeouts). Thread and run queue data structures use O(1) enqueue and dequeue operations. Thread caching minimizes dynamic thread creation overhead by reusing terminated threads.
No kernel modifications or privileges are required; all scheduling logic, affinity management, and context switching are achieved through user-level hooks and maintained data structures.
5. Performance Evaluation and Benchmarking
Extensive evaluation on platforms such as Marenostrum 5 (dual Sapphire Rapids 8480+, 112 threads) demonstrates that USF with SCHED_COOP yields significant performance gains in oversubscribed environments:
- In nested-runtimes for matrix multiplication (MatMul), SCHED_COOP improved throughput by up to 1.28×, with manual nOS-V integration up to 1.36× over baseline kernel scheduling under high oversubscription.
- For Cholesky decomposition with varying oversubscription levels, SCHED_COOP achieved up to 6.86× speedup versus baseline for BLIS or OpenBLAS workloads nested with OpenMP or TBB threads.
- Multi-process AI inference microservices (e.g., LLaMA-1B, GPT-2, RoBERTA) experienced up to 2.4× throughput gains at high load versus baseline kernel, outperforming both static partitioning and non-partitioned scheduling.
- Co-execution of molecular dynamics ensembles (LAMMPS+DeePMD-kit) yielded ∼4% gain over the best native kernel scheduling in memory bandwidth-limited scenarios, demonstrating efficient overlap and minimized blocking noise.
Profiling confirms that worker threads are pinned 1:1 to cores, simplifying trace analysis and reducing scheduler-induced noise (Roca et al., 28 Jan 2026).
6. Guidelines for Extension and Tuning
USF’s design enables straightforward customization and tuning:
- Custom scheduling policies are defined via select_task callbacks in nOS-V, and are exposed to the framework without kernel recompilation.
- Best practices include setting passive wait policies in nested runtimes, using nosv_pause or occasional sched_yield in tight spin-loops, and balancing granularities to avoid excessive or insufficient task generation per runtime level.
- For debugging, glibcv provides USF_DEBUG logging, state dumping (nosv_dump_state), and compatibility with low-level profilers.
- Affinity, quantum, and priority tunables may be adjusted via environment variables or pthread_setschedparam extensions to glibcv.
7. Relationship to Other User-space Scheduling Approaches
USF is part of a broader trend toward moving scheduling abstractions from the kernel to user space. Other systems include LibPreemptible (Lisa et al., 2023), which achieves hardware-assisted preemptive scheduling with microsecond-scale granularity on Intel Sapphire Rapids via UINTR; and SFS (Fu et al., 2022), which orchestrates existing Linux FIFO and CFS mechanisms in user space to approximate Shortest Remaining Time First for function-as-a-service deployments. While LibPreemptible focuses on deadline-oriented policies and hardware-delivered user interrupts, and SFS targets short-task prioritization under serverless computing, USF emphasizes compatibility with standard pthreads/TLS, multi-runtime and multi-process coordination, and seamless deployment without kernel modification. Each approach addresses specific trade-offs in overhead, granularity, policy expressiveness, and target workload classes.
A plausible implication is that as workloads grow more sophisticated, and as hardware platforms expose finer control mechanisms, user-space scheduling frameworks like USF may become the reference standard for flexible resource management in high-density computational environments (Roca et al., 28 Jan 2026, Lisa et al., 2023, Fu et al., 2022).