- The paper demonstrates that using up to 75% pooled memory results in less than 18% performance degradation for scientific workloads, validating the approach.
- The methodology employs an emulator and profiler to assess diverse memory compositions and high-bandwidth configurations across various HPC applications.
- The study identifies challenges with interference in shared memory pools, emphasizing the need for effective system-level coordination in dynamic memory environments.
Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems
Introduction
The paper discusses the potential of Compute Express Link (CXL)-enabled memory pooling in addressing the limitations of current high-performance computing (HPC) systems, which suffer from low memory utilization due to statically configured memory resources. CXL's ability to decouple memory capacity from bandwidth provision allows for more granular memory management, improving resource utilization. This study evaluates composable memory subsystems and showcases the benefits of CXL-enabled memory pooling for various HPC workloads.
CXL-enabled Memory Subsystem Design
CXL is a standard that allows processors, accelerators, and memory to be interconnected with low latency and high bandwidth, significantly enhancing memory disaggregation capabilities. The paper introduces a potential design of a composable memory system using CXL links (Figure 1). This system allows dynamic configuration of memory resources, such as scaling bandwidth or incorporating different memory types. The dynamic configuration can cater to the specific needs of various workloads, optimizing performance and utilization.
Figure 1: An potential composable memory system design. CXL enables multiple memory organizations on one system.
Methodology
The research employs an emulator to explore the effects of diverse memory compositions using CXL-enabled systems. A profiling tool was developed to analyze dynamic memory usage patterns across various applications. The emulator and profiler were used on seven scientific and six graph applications, assessing performance impact via various memory configurations.
Emulation was conducted on NUMA systems, representing CXL-enabled memory subsystems, to measure the effect of pooled memory on execution times and bandwidth scaling.
Composable Memory Capacity
The study evaluated the performance impact of different compositions of local and pooled memory (Figure 2). For most scientific workloads, using up to 75% pooled memory resulted in less than 18% performance degradation, demonstrating the feasibility of memory capacity composability in mitigating performance impacts.
Figure 2: Four compositions of the memory subsystem using a variable amount of local memory and pooled memory.
Composable Memory Bandwidth
For bandwidth-intensive applications, a high-bandwidth configuration using CXL links was evaluated (Figure 3). The results indicated significant performance improvements, particularly for applications like OpenFOAM and Hypre, suggesting that CXL-enabled memory systems could serve as cost-effective alternatives to expensive HBM memory systems.
Figure 3: An emulated high-bandwidth configuration of the memory system. Increased CXL links provide more bandwidth.
Interference in Shared Memory Pools
Experiments demonstrated that interference in shared memory pools could degrade performance, especially for bandwidth-sensitive jobs. Application performance was tested under varying shared conditions (Figure 4). Results underscored the necessity for system-level coordination to mitigate interference and effectively manage shared resources.
Figure 4: An emulated configuration of a memory pool shared by multiple hosts, evaluating the impact of interference.
Conclusion
The paper highlights the promise of CXL-enabled memory pooling in enhancing memory utilization and performance within HPC environments. By decoupling bandwidth and capacity provisioning, CXL allows for customized memory configurations that align with specific workload requirements. Although challenges, such as managing interference on shared pools, remain, the potential benefits in scalable bandwidth and resource efficiency offer a compelling case for further exploration of CXL-enabled systems in HPC contexts.