Papers
Topics
Authors
Recent
Search
2000 character limit reached

Design Principles of Dynamic Resource Management for High-Performance Parallel Programming Models

Published 25 Mar 2024 in cs.DC | (2403.17107v1)

Abstract: With Dynamic Resource Management (DRM) the resources assigned to a job can be changed dynamically during its execution. From the system's perspective, DRM opens a new level of flexibility in resource allocation and job scheduling and therefore has the potential to improve system efficiency metrics such as the utilization rate, job throughput, energy efficiency, and responsiveness. From the application perspective, users can tailor the resources they request to their needs offering potential optimizations in queuing time or charged costs. Despite these obvious advantages and many attempts over the last decade to establish DRM in HPC, it remains a concept discussed in academia rather than being successfully deployed on production systems. This stems from the fact that support for DRM requires changes in all the layers of the HPC system software stack including applications, programming models, process managers, and resource management software, as well as an extensive and holistic co-design process to establish new techniques and policies for scheduling and resource optimization. In this work, we therefore start with the assumption that resources are accessible by processes executed either on them (e.g., on CPU) or controlling them (e.g., GPU-offloading). Then, the overall DRM problem can be decomposed into dynamic process management (DPM) and dynamic resource mapping or allocation (DRA). The former determines which processes (or which change in processes) must be managed and the latter identifies the resources where they will be executed. The interfaces for such \mbox{DPM/DPA} in these layers need to be standardized, which requires a careful design to be interoperable while providing high flexibility. Based on a survey of existing approaches we propose design principles, that form the basis of a holistic approach to DMR in HPC and provide a prototype implementation using MPI.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 0 likes about this paper.