Investigate the EpiTo assembly speed-up discrepancy

Investigate the cause of the larger assembly-phase speed-up observed on the EpiTo system (Arm Neoverse N1 CPU with NVIDIA A100 GPU) relative to other platforms when running the OpenFOAM laplacianFoam proof-of-concept using ISO C++ PSTL offload; specifically, determine whether Arm Neoverse N1 CPU core performance differences relative to Intel Xeon and NVIDIA Grace CPUs account for the discrepancy and quantify any additional contributing factors beyond GPU performance, which matched expectations.

Background

In single-GPU results for the assembly phase on the smallest mesh, the EpiTo system exhibited a speed-up of 5.6x compared to the CPU baseline, whereas other systems achieved between 1.98x and 2.82x. The authors note that GPU timings aligned with expectations across architectures, suggesting that the discrepancy likely arises from differences on the CPU side.

The paper hypothesizes that Arm Neoverse N1 CPU cores are less performant than the cores in the other tested platforms, but it defers a thorough investigation to future work. This points to an unresolved question regarding the precise factors causing the anomalous speed-up on EpiTo and how CPU baseline performance interacts with GPU-offloaded assembly.

References

While a deeper investigation is left for future work, our main hypothesis is that the cores of the Arm Neoverse N1 CPU are less performant with respect to the cores available on the other platforms.

Building an Accelerated OpenFOAM Proof-of-Concept Application using Modern C++  (2507.18268 - Malenza et al., 24 Jul 2025) in Section: Evaluation, Subsection: Performance results on single GPU