Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications
Abstract: The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion to study the resilience of this domain of applications sharing similar program characteristics. However, it is challenging to achieve application resilience: (a) how to identify the invariants of a given domain of applications, knowing the conservation laws, and (b) how to utilize the invariants to efficiently detect and recover from failures in application runs. In this work, we target several continuum dynamics software packages, FleCSALE [1] and CODY 2, study their resilience to soft errors online (injected using an open-source fault injector), and investigate the opportunities for non-intrusive and lightweight failure recovery (checksum-based invariant checking). We propose a checksum-retry approach to achieve our goals, and experimental results on a virtualized platform with extensive fault injection campaigns demonstrate the effectiveness and efficiency of the proposed approach.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.