Papers
Topics
Authors
Recent
Search
2000 character limit reached

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Published 9 May 2021 in cs.AR, cs.DC, and cs.PF | (2105.03814v7)

Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM, a benchmark suite of 16 workloads from different application domains (e.g., linear algebra, databases, graph processing, neural networks, bioinformatics).

Citations (74)

Summary

  • The paper demonstrates that UPMEM’s PIM system integrates DPUs in DRAM to reduce data movement and boost energy efficiency for memory-bound workloads.
  • It employs a comprehensive PrIM benchmark suite across various domains to reveal significant performance gains over CPUs and targeted energy benefits compared to GPUs.
  • The study highlights the need for enhanced DPU intercommunication and advanced arithmetic capabilities to better support complex computations in future PIM systems.

An Expert Analysis of UPMEM's Processing-In-Memory Architecture

The paper provides a comprehensive overview of the UPMEM processing-in-memory (PIM) architecture, which represents a significant step towards implementing PIM systems with commercially available hardware. It evaluates the architecture, presents a suite of benchmarks tailored for PIM (PrIM), and compares UPMEM's performance and energy efficiency against modern CPUs and GPUs.

Architecture and Evaluation

The UPMEM PIM system integrates processing capabilities directly within memory, using DRAM Processing Units (DPUs) embedded in conventional 2D DRAM technology. The architecture presents a solution to bypass the traditional data movement bottlenecks between memory and CPU cores by performing computations directly in memory. The system features multiple DPUs per DRAM chip, each possessing its isolated instruction and working RAM, allowing for concurrent execution of up to 24 tasklets.

This architecture is theoretically capable of significantly reducing latency and energy consumption for memory-bound workloads. However, the design decision to employ in-order cores with limited arithmetic units suggests the architecture's best fit lies with tasks requiring simple arithmetic routines. The paper provides empirical evidence through a microbenchmark analysis demonstrating that while the UPMEM system can achieve high bandwidth for memory accesses, it is fundamentally compute-bound due to the lack of native support for complex operations beyond simple integer arithmetic.

Benchmark Suite Analysis

The study introduces PrIM, a suite of 16 benchmarks spanning diverse domains from dense/sparse linear algebra to graph processing and bioinformatics. This benchmark suite assesses the suitability of the UPMEM system for varied workload characteristics, particularly focusing on memory access and synchronization patterns. The findings suggest that workloads falling within the suite's definition—those that balance memory bandwidth utilization effectively with limited computational demands—are those that will typically benefit most from the UPMEM's architecture.

Performance Comparisons and Future Directions

Comparative analysis with modern CPU and GPU systems indicates that UPMEM outperforms CPUs across many categories, achieving substantial improvements in energy efficiency due to reduced data movement. However, against GPUs, its performance advantage is confined to specific types of workloads characterized by low computational intensity and minimal inter-DPU communication demands.

For future architectures, the authors suggest enhancing direct communication between DPUs, incorporating more sophisticated arithmetic units capable of handling complex operations natively, and effectively utilizing the memory hierarchy to support diverse application types. Further software optimizations and refined libraries for common operations could supplement this hardware development.

Conclusion

In essence, the UPMEM PIM system signifies a meaningful progress toward realizing efficient, scalable, and memory-centric processing architectures. While competitive limitations exist, particularly in comparison to GPUs for computation-heavy tasks, the architecture's promise is evident for memory-bound applications. The findings and benchmark suite presented within this comprehensive evaluation lay a foundation for future developments in PIM technologies, potentially influencing a shift toward memory-centric computing paradigms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of unresolved issues the paper leaves open. Each item identifies what is missing or uncertain and suggests concrete directions for future research.

  • Quantify and reduce data layout transposition overhead: The SDK’s transparent main-memory↔MRAM transposition is required for correct data mapping but its latency/energy cost, sensitivity to buffer sizes, and impact on end-to-end performance are not characterized. Investigate algorithmic and hardware alternatives (e.g., different memory mappings, in-DPU remapping engines, or compiler-driven packed layouts) and measure their benefits.
  • Enable and evaluate direct DPU-to-DPU communication: Current systems force inter-DPU communication via the host CPU, which severely limits scalability for communication-heavy workloads (e.g., BFS, Needleman–Wunsch). Explore hardware interconnects (ring/mesh/NoC across DPUs within and across DIMMs), lightweight message passing or RDMA-like primitives, and their bandwidth/latency requirements; prototype runtime and API support and quantify gains.
  • Relax parallel transfer constraints and improve host↔MRAM concurrency: Parallel CPU↔MRAM transfers require equal-sized buffers and CPU/DPU cannot concurrently access the same MRAM bank. Study hardware/runtime changes to support variable-size parallel transfers, fine-grained scheduling, and safe concurrent host/DPU access (e.g., coherence, lock-based protocols, ownership transfer), and evaluate overlap strategies.
  • Systematic overlap of communication and computation: The paper uses only synchronous kernel launches. Assess asynchronous execution, double-buffering, and pipelined MRAM↔WRAM DMA overlapping with compute to hide transfer costs. Provide quantitative models and guidelines for effective overlap under varying kernel sizes and tasklet counts.
  • Develop a PIM-specific performance model (roofline-like): Workloads classified as memory-bound on CPU become compute-bound on UPMEM. Derive a roofline model that incorporates DPU pipeline throughput, MRAM↔WRAM bandwidth, DMA latencies, tasklet-induced parallelism, and inter-DPU communication costs to predict performance and guide mapping/tuning.
  • Floating-point capability and complex arithmetic support: Performance degrades with floating-point operations and integer mul/div. Clarify whether FP ops are software-emulated or hardware-supported; characterize throughput/latency per op and per data type, and evaluate microarchitectural enhancements (e.g., FP units, SIMD lanes, fused ops) and their area/thermal trade-offs.
  • Detailed MRAM/WRAM microarchitectural behavior under contention: Beyond sustained bandwidth, characterize access latencies, arbitration/port conflicts, alignment penalties, strided/random access patterns, and the impact of 24 tasklets on WRAM/MRAM service times. Provide guidance on buffer sizes, alignment, and loop structuring to avoid hotspots.
  • Barrier, mutex, semaphore overheads and scalability: The paper introduces synchronization primitives but does not quantify their costs. Measure per-primitive latency under varying tasklet counts and sharing patterns; propose scalable intra-DPU synchronization schemes (e.g., hierarchical barriers, lock-free data structures) and their effects on throughput.
  • IRAM capacity and dynamic code loading: With only 24 KB IRAM (≈4,096 48-bit instructions), large kernels or library-heavy code may require dynamic loading. Quantify IRAM utilization constraints, DMA costs to load instructions, and techniques like code compression, function splitting, and multi-phase kernels; evaluate impacts on realistic applications.
  • Cacheless programming model and software-managed locality: DPUs lack caches; programmers must explicitly orchestrate MRAM↔WRAM transfers. Investigate compiler/runtime support for software-managed caching (tiling, prefetching, replacement policies) and auto-tuning of transfer granularities, with empirical evaluation across memory access patterns.
  • Energy measurement methodology and breakdown: The paper reports energy trends but does not detail measurement setup, per-component breakdown (host DRAM, DPUs, MRAM transfers, transposition), or idle power. Establish standardized instrumentation and reporting (e.g., shunt-based per-DIMM measurement, firmware counters), and present energy-per-byte/operation metrics.
  • Reliability, yield, and fault tolerance: The presence of faulty DPUs is noted but not analyzed. Characterize failure modes (DPU vs MRAM), ECC on MRAM/WRAM/IRAM, error rates, and environmental factors; design runtime policies for detection, isolation, remapping, and graceful degradation in large deployments.
  • Security and isolation: Security implications (e.g., MRAM DMA access control, side-channel/leakage across tasklets/DPUs, integrity of transposition routines) are not discussed. Define threat models and evaluate isolation mechanisms (permissioned MRAM regions, per-DPU sandboxes, secure DMA, attestation) for multi-tenant scenarios.
  • Fairness and coverage in CPU/GPU comparisons: The study uses specific CPU/GPU baselines; the optimization levels, library choices, and hardware generations significantly affect results. Extend comparisons to diverse, current GPU/CPU architectures, report tuning details, and include sensitivity analyses (e.g., batch sizes, tiling, precision) for fairness and reproducibility.
  • Generalizability to broader workloads: PrIM contains 16 workloads, but important classes (e.g., modern DNN training/inference pipelines with mixed precision, key-value stores, streaming analytics, graph algorithms beyond BFS) are underexplored. Add representative workloads, especially those with irregular access and dynamic communication, and study end-to-end pipelines including host-side orchestration.
  • Multi-node scale-out and memory-controller contention: The interaction between many UPMEM DIMMs and host memory controllers/channels, especially in multi-socket systems, is not deeply analyzed. Characterize controller-level contention, scheduling policies, and cross-socket traffic; explore topology-aware placement and NUMA-aware runtime policies.
  • Tasklet scheduling, load balancing, and heterogeneity: With ≥11 tasklets needed to fill the pipeline, the impact of >11 on contention and diminishing returns is only partially shown. Develop runtime mechanisms for dynamic load balancing, work stealing across DPUs, and heterogeneous tasklet configurations tuned to kernel characteristics.
  • Impact of MRAM capacity limits on algorithm design: Each DPU has 64 MB MRAM, potentially forcing partitioning/tiling for large datasets. Quantify partitioning overheads, inter-DPU merging costs, and algorithmic transformations required; provide reusable patterns for scalable tiling with minimal host interaction.
  • Tooling, profiling, and observability: There is limited visibility into DPU microarchitectural events (stalls, pipeline occupancy, DMA overlap). Develop and validate profiling tools (performance counters, trace collection, ISA-level instrumentation) and integrate them with standard toolchains for actionable feedback to developers.
  • Co-design of CPU–PIM execution: The paper does not explore optimal division of labor between host and DPUs (e.g., control-heavy phases on CPU, data-parallel phases on DPUs) under asynchronous execution. Create scheduling frameworks that co-optimize partitioning, transfer overlap, and resource usage to minimize end-to-end time and energy.
  • Thermal and power-density constraints: Large UPMEM deployments may face thermal limits; the paper does not provide thermal/power-density characterization. Measure per-DIMM/DPU thermal behavior under sustained workloads, assess cooling requirements, and study DVFS or power capping policies for safe, efficient operation.
  • Cross-architecture PIM benchmarking: Results focus on UPMEM; there is no comparative study against other PIM approaches (3D-stacked PNM, analog PUM). Build portable benchmarks and evaluation methodology to compare performance/energy, programmability, and cost across PIM architectures.
  • Memory consistency and semantics for shared MRAM: Since host/DPU cannot simultaneously access a bank and coherence is absent, formalize memory consistency models and design minimal hardware/software coherence or explicit ownership-transfer semantics; evaluate their programmability and performance impacts.
  • API/runtime evolution for irregular, variable-sized data: Current SDK favors regular, equal-sized parallel transfers and static SPMD tasking. Propose runtime features for irregular data (scatter/gather, variable-sized partitions), asynchronous callbacks, and collective operations (reduce, all-to-all) optimized for PIM constraints.
  • Quantification of synchronization-induced bottlenecks in specific apps: While BFS and NW are called out as problematic, the exact breakdown (communication vs compute vs synchronization) is not provided. Perform fine-grained tracing and quantify bottlenecks to guide algorithmic redesign (e.g., frontier compression, hierarchical reductions).

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.