Locality-Centric Dynamic Scheduling Scheme
- Locality-centric dynamic scheduling schemes are strategies that use explicit models of data, memory, and communication locality to minimize remote transfers and improve overall system performance.
- They dynamically adapt task, process, or resource mappings based on real-time locality metrics such as cache affinity and NUMA tagging, enabling precise and efficient scheduling.
- These approaches are applied in OS process scheduling, cloud computing, and hardware accelerators to reduce overhead, balance load, and maintain fairness.
A locality-centric dynamic scheduling scheme is a dynamic scheduling strategy that incorporates explicit models of data, memory, or communication locality into scheduling decisions to maximize system performance, reduce conflict-induced overheads, and, in many cases, control power or fairness. These schemes dynamically adapt task, process, or resource mappings at runtime based on observed or predicted access patterns, leveraging on-chip or in-memory reuse and minimizing costly remote or off-chip transfers. Locality-centric approaches are essential across OS process scheduling, cloud/job paradigms, and hardware accelerators, and they are characterized by mechanisms for measuring, modeling, and acting on locality at runtime.
1. Locality Metrics and Formal Models
Locality-centric schedulers use explicit, application-specific locality metrics to guide mapping and migration.
- Cache/Data sharing metrics (MPSoC Process Scheduling): Each process is represented by the set of memory elements it accesses; the degree of data sharing between a pair of processes is . This is summarized in a global sharing matrix used by the scheduler (0710.4652).
- Access-locality and NUMA affinity (ccNUMA/Task Queues): Tasks/blocks are tagged with their locality domain (LD); scheduling strives for maximum fraction of tasks processed in their home domain to maximize
(0902.1884, Wittmann et al., 2010).
- Bank-level DRAM locality (Memory Controllers): Per-core metrics such as capture the temporal proximity of accesses to the same DRAM row, directly informing scheduling priorities. Additional metrics for bank-level parallelism and historical service rates are used in RL-driven schedulers (Sanchez et al., 2019).
- Scheduling graphs and affinity matrices (Demand-Aware Networking): The overlap
quantifies temporal locality in datacenter traffic, guiding dynamic matching algorithms that only update the "delta" changed edges (Hanauer et al., 2023).
- Prefix/KV cache hits (LLM Serving): Prefix locality is formalized by the prefix cache hit rate:
Batch construction and scheduling exploit this metric to minimize redundant computation and optimize GPU memory usage (Cao et al., 24 Jan 2025).
2. Core Scheduling Algorithms
Key algorithmic ideas span OS-level, user-level, and hardware-accelerator settings:
- MPSoC OS Scheduler (0710.4652):
- Ready-to-run processes are partitioned by data sharing: non-sharing processes are distributed across cores; sharing/depended processes are sequenced on the same core.
- Scheduling uses the global sharing matrix in two stages: selection by minimal sharing sum (for independent processes), then maximizing reuse with for process-to-core mapping.
- Hybrid Static/Dynamic Task Scheduling (Donfack et al., 2011):
- Dense matrix factorization is scheduled statically in an initial region for critical-path tasks (maximizing cache affinity), then dynamically for trailing updates, with depth-first traversal to minimize queue overhead.
- The ratio of static to dynamic scheduling is optimized according to load imbalance bounds and communication costs.
- NUMA/LD-Aware Task Queuing (0902.1884, Wittmann et al., 2010):
- Tasks are enqueued into per-LD queues based on first-touch memory placement; threads dequeue from their own LD, stealing from others only if needed.
- Load balancing is preserved within LD; cross-LD stealin' is a tunable trade-off based on desired locality and throughput.
- Dynamic Network Matching (Hanauer et al., 2023):
- Dynamic and batch-dynamic algorithms update only that part of the topology affected by recent demand changes.
- Algorithms such as dyn-greedy and dyn-kEC operate in 0 time when only a small 1 fraction of the network has changed, exploiting high temporal locality in network demands.
- Locality-Driven Memory Scheduling (Sanchez et al., 2019):
- RL-based controller observes access patterns, row-hit rates, per-bank parallelism, and starvation; learns to prioritize cores to maximize locality and fairness.
- Prefix-Locality–Aware LLM Batch Schedulers (Cao et al., 24 Jan 2025):
- Batches are constructed by sorting requests by longest prefix match; service quanta and deficit counters (DLPM) guarantee both locality and fair allocation.
- Distributed scenarios layer per-client per-worker deficit management and scalable prefix-to-worker lookup structures.
- Transformer Accelerator Operand Scheduling (Fan et al., 28 Jan 2026):
- QK operand flow is dynamically reordered both intra- and inter-head to maximize reuse of Query/Key vectors on chip.
- Classification of operand "types" and adaptively chosen "heavy size" drives phase-wise Q/K feed order and sustained MAC array utilization.
3. Dynamic/Runtime Mechanisms and Cost Models
Effective locality-centric dynamic scheduling requires low-overhead, scalable runtime techniques.
- Measurement, Instrumentation, and Feedback:
- OS and runtime layers maintain access counters, sharing matrices, profiling structures (e.g., (0710.4652); per-task locality tags (0902.1884, Wittmann et al., 2010); RL-agent state (Sanchez et al., 2019)).
- Performance counters and hardware events are periodically sampled for online model updates.
- Queue and Stealing Coordination:
- NUMA queue implementations use per-domain locks or optimized concurrent-queue structures.
- Task stealing is hierarchically or affinity constrained to prioritize local before remote extraction.
- Online Learning and Cost Adaption:
- RL-based policies (e.g., CADS) update parameters at each decision point using standard Q-learning, encoding both throughput and fairness objectives (Sanchez et al., 2019).
- In ARMS, per-task/cost models are continuously updated based on observed execution times on resource partitions indexed by type and software topology address (Abduljabbar et al., 2021).
- Bounded Overheads:
- Most schemes incur only a few percent overhead paid at quantum boundaries, task-launch, or queue manipulation, with scheduling and monitoring structures designed for 2 update time (with 3 = process count), or 4 per-task in hybrid static/dynamic approaches (0710.4652, Donfack et al., 2011, 0902.1884).
4. Application Domains and Case Studies
Locality-centric dynamic scheduling is essential in diverse compute scenarios:
| Domain | Key Mechanisms / Outcomes | Representative Reference |
|---|---|---|
| Embedded MPSoCs | OS process-to-core mapping via global sharing matrix; 20–40% reduction in completion time | (0710.4652) |
| NUMA/Multicore | Locality-tagged task queues, per-LD stealing; nearly maximal node bandwidth, 3× faster than round-robin | (0902.18841101.0093) |
| Dense Linear Algebra | Hybrid static/dynamic for panel vs. update phase; up to 64% improvement vs. fully dynamic | (Donfack et al., 2011) |
| LLM Serving | Deficit-longest-prefix-match batching with fairness bounds, 2.87× throughput, up to 7.18× P99 latency reduction | (Cao et al., 24 Jan 2025) |
| Transform. Accel. | Schedule TopK token attention operands to maximize Q/K on-chip reuse; up to 1.76× throughput | (Fan et al., 28 Jan 2026) |
| Multicore MCs | RL-based controller, explicit row-hit/bank metrics; up to 20% CPI improvement | (Sanchez et al., 2019) |
| Datacenter Networking | Incremental (batch-)dynamic matching optimizing on traffic locality; up to 5× speedup for small-delta batches | (Hanauer et al., 2023) |
Key benchmark and architecture outcomes substantiate performance gains that derive directly from data reuse, minimized remote traffic, enhanced bank or cache hit rates, and scheme adaptivity.
5. Trade-Offs, Extensions, and Limitations
Locality-centric dynamic scheduling presents several trade-offs:
- Performance vs. Overhead: Aggressive migration or fine-grained quanta can potentially yield higher locality but may increase TLB shoot-down, data remapping, or scheduler coordination costs (0710.4652).
- Load Balance vs. Locality: Dynamic stealing/assignment between domains or across devices optimizes resource utilization but may reduce data or cache affinity (Wittmann et al., 2010, 0902.1884).
- Fairness vs. Locality: The DLPM/D²LPM class in LLM serving explicitly quantifies fairness bounds as a function of batch/quantum size, providing tunable parameters to manage the locality-fairness Pareto frontier (Cao et al., 24 Jan 2025).
- Adaptivity to Heterogeneity: Extensions support node/resource heterogeneity (e.g., weighting sharing matrix scores by core power, integrating multi-level cache sharing matrices, leveraging real-time performance feedback) (0710.4652, Abduljabbar et al., 2021).
Scheme-specific generalizations include support for heterogeneous architectures, multi-level cache hierarchies, explicit integration with static compiler-provided data mappings, and plug-and-play applicability to new accelerator engines or scheduling substrates.
6. Comparative Evaluation and Empirical Outcomes
Empirical studies consistently demonstrate the quantitative benefit of locality-centric scheduling across platforms and workloads:
- Execution Time and Throughput: LS/LSM policies in MPSoCs yield 20–40% reductions in completion time and 15–25% lower cache misses (0710.4652).
- Bandwidth and Scalability: NUMA-aware task queues approach single-domain theoretical bandwidth ceilings with <5% overhead and minimal imbalance (0902.1884, Wittmann et al., 2010).
- Fairness-Preserving Locality: DLPM/D²LPM achieves up to 2.87× throughput and 7.18× latency reduction vs. non-locality-aware baselines, with formal fairness bounds (Cao et al., 24 Jan 2025).
- Accelerator and Memory Efficiency: Scheduling operand flows in SATA yields up to 1.76× QK-attention throughput and 2.94× energy efficiency (Fan et al., 28 Jan 2026).
- Datacenter Responsiveness: Batch-dynamic matching reduces topology reconfiguration time by 3–5× for small batch deltas, achieving nearly static-throughput with sublinear recourse (Hanauer et al., 2023).
- Dynamic Adaptivity: ARMS’s fully online moldability achieves up to 3.5× speedup on memory- or compute-bound chains, with little or no pre-tuning (Abduljabbar et al., 2021).
These results are robust across simulation and real-system benchmarks, with observed advantages increasing as the degree of locality or traffic persistence grows.
7. Extensions, Generalizations, and Outlook
Locality-centric dynamic scheduling is a general paradigm with broad applicability.
- Beyond Hardware Locality: Recent advances (e.g., (Fan et al., 28 Jan 2026, Chen et al., 5 Dec 2025)) extend the concept to operand, prefix, or infrastructure locality, as in heterogeneous serverless environments exploiting "warm start" reuse and predictive scheduling with online learning.
- Supporting Heterogeneous Resources: State- or affinity-aware placement can integrate heterogeneity in compute capability, memory subsystem layout, network topology, or workload (e.g., core-speed weighting (0710.4652); distributed quantum structures (Cao et al., 24 Jan 2025)).
- Integrating Compiler and Runtime: Passing locality information from compiler analysis to runtime schedulers, or using runtime feedback to enforce “moldable” mappings, closes the semantic gap between static program structure and dynamic scheduling needs (0710.4652, Abduljabbar et al., 2021).
- Theoretical Boundaries: Many underlying scheduling problems are NP-hard; dynamic/batch-dynamic incremental algorithms rely on high temporal traffic or access overlap to stay within tractable computational budgets (Hanauer et al., 2023).
- Adaptivity Under Dynamic Load: Locality-centric schemes adapt at runtime to process/task arrival rates, DAG parallelism shifts, or changing server/resource count, providing robustness to practical deployment variability.
These strategies form a foundation for locality-aware, load-balanced, and power/performance–optimized scheduling in emerging and future multicore, accelerator, and cloud platforms.