Acceleration of a production Solar MHD code with Fortran standard parallelism: From OpenACC to `do concurrent'

Published 5 Mar 2023 in cs.MS, astro-ph.IM, cs.DC, and cs.PL | (2303.03398v2)

Abstract: There is growing interest in using standard language constructs for accelerated computing, avoiding the need for (often vendor-specific) external APIs. These constructs hold the potential to be more portable and much more `future-proof'. For Fortran codes, the current focus is on the {\tt do concurrent} (DC) loop. While there have been some successful examples of GPU-acceleration using DC for benchmark and/or small codes, its widespread adoption will require demonstrations of its use in full-size applications. Here, we look at the current capabilities and performance of using DC in a production application called Magnetohydrodynamic Algorithm outside a Sphere (MAS). MAS is a state-of-the-art model for studying coronal and heliospheric dynamics, is over 70,000 lines long, and has previously been ported to GPUs using MPI+OpenACC. We attempt to eliminate as many of its OpenACC directives as possible in favor of DC. We show that using the NVIDIA {\tt nvfortran} compiler's Fortran 202X preview implementation, unified managed memory, and modified MPI launch methods, we can achieve GPU acceleration across multiple GPUs without using a single OpenACC directive. However, doing so results in a slowdown between 1.25x and 3x. We discuss what future improvements are needed to avoid this loss, and show how we can still retain close