Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Aggregated Asynchronous Checkpointing

Published 4 Dec 2021 in cs.DC | (2112.02289v1)

Abstract: High-Performance Computing (HPC) applications need to checkpoint massive amounts of data at scale. Multi-level asynchronous checkpoint runtimes like VELOC (Very Low Overhead Checkpoint Strategy) are gaining popularity among application scientists for their ability to leverage fast node-local storage and flush independently to stable, external storage (e.g., parallel file systems) in the background. Currently, VELOC adopts a one-file-per-process flush strategy, which results in a large number of files being written to external storage, thereby overwhelming metadata servers and making it difficult to transfer and access checkpoints as a whole. This paper discusses the viability and challenges of designing aggregation techniques for asynchronous multi-level checkpointing. To this end we implement and study two aggregation strategies, their limitations, and propose a new aggregation strategy specifically for asynchronous multi-level checkpointing.

Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.