Papers
Topics
Authors
Recent
Search
2000 character limit reached

EasyCrash: Exploring Non-Volatility of Non-Volatile Memory for High Performance Computing Under Failures

Published 24 Jun 2019 in cs.PF | (1906.10081v1)

Abstract: Emerging non-volatile memory (NVM) is promising for building future HPC. Leveraging the non-volatility of NVM as main memory, we can restart the application using data objects remaining on NVM when the application crashes. This paper explores this solution to handle HPC under failures, based on the observation that many HPC applications have good enough intrinsic fault tolerance. To improve the possibility of successful recomputation with correct outcomes and ignorable performance loss, we introduce EasyCrash, a framework to decide how to selectively persist application data objects during application execution. Our evaluation shows that EasyCrash transforms 54% of crashes that cannot correctly recompute into the correct computation while incurring a negligible performance overhead (1.5% on average). Using EasyCrash and application intrinsic fault tolerance, 82% of crashes can successfully recompute. When EasyCrash is used with a traditional checkpoint scheme, it enables up to 24% improvement (15% on average) in system efficiency.

Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.