Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture
Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM, a benchmark suite of 16 workloads from different application domains (e.g., linear algebra, databases, graph processing, neural networks, bioinformatics).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of unresolved issues the paper leaves open. Each item identifies what is missing or uncertain and suggests concrete directions for future research.
- Quantify and reduce data layout transposition overhead: The SDK’s transparent main-memory↔MRAM transposition is required for correct data mapping but its latency/energy cost, sensitivity to buffer sizes, and impact on end-to-end performance are not characterized. Investigate algorithmic and hardware alternatives (e.g., different memory mappings, in-DPU remapping engines, or compiler-driven packed layouts) and measure their benefits.
- Enable and evaluate direct DPU-to-DPU communication: Current systems force inter-DPU communication via the host CPU, which severely limits scalability for communication-heavy workloads (e.g., BFS, Needleman–Wunsch). Explore hardware interconnects (ring/mesh/NoC across DPUs within and across DIMMs), lightweight message passing or RDMA-like primitives, and their bandwidth/latency requirements; prototype runtime and API support and quantify gains.
- Relax parallel transfer constraints and improve host↔MRAM concurrency: Parallel CPU↔MRAM transfers require equal-sized buffers and CPU/DPU cannot concurrently access the same MRAM bank. Study hardware/runtime changes to support variable-size parallel transfers, fine-grained scheduling, and safe concurrent host/DPU access (e.g., coherence, lock-based protocols, ownership transfer), and evaluate overlap strategies.
- Systematic overlap of communication and computation: The paper uses only synchronous kernel launches. Assess asynchronous execution, double-buffering, and pipelined MRAM↔WRAM DMA overlapping with compute to hide transfer costs. Provide quantitative models and guidelines for effective overlap under varying kernel sizes and tasklet counts.
- Develop a PIM-specific performance model (roofline-like): Workloads classified as memory-bound on CPU become compute-bound on UPMEM. Derive a roofline model that incorporates DPU pipeline throughput, MRAM↔WRAM bandwidth, DMA latencies, tasklet-induced parallelism, and inter-DPU communication costs to predict performance and guide mapping/tuning.
- Floating-point capability and complex arithmetic support: Performance degrades with floating-point operations and integer mul/div. Clarify whether FP ops are software-emulated or hardware-supported; characterize throughput/latency per op and per data type, and evaluate microarchitectural enhancements (e.g., FP units, SIMD lanes, fused ops) and their area/thermal trade-offs.
- Detailed MRAM/WRAM microarchitectural behavior under contention: Beyond sustained bandwidth, characterize access latencies, arbitration/port conflicts, alignment penalties, strided/random access patterns, and the impact of 24 tasklets on WRAM/MRAM service times. Provide guidance on buffer sizes, alignment, and loop structuring to avoid hotspots.
- Barrier, mutex, semaphore overheads and scalability: The paper introduces synchronization primitives but does not quantify their costs. Measure per-primitive latency under varying tasklet counts and sharing patterns; propose scalable intra-DPU synchronization schemes (e.g., hierarchical barriers, lock-free data structures) and their effects on throughput.
- IRAM capacity and dynamic code loading: With only 24 KB IRAM (≈4,096 48-bit instructions), large kernels or library-heavy code may require dynamic loading. Quantify IRAM utilization constraints, DMA costs to load instructions, and techniques like code compression, function splitting, and multi-phase kernels; evaluate impacts on realistic applications.
- Cacheless programming model and software-managed locality: DPUs lack caches; programmers must explicitly orchestrate MRAM↔WRAM transfers. Investigate compiler/runtime support for software-managed caching (tiling, prefetching, replacement policies) and auto-tuning of transfer granularities, with empirical evaluation across memory access patterns.
- Energy measurement methodology and breakdown: The paper reports energy trends but does not detail measurement setup, per-component breakdown (host DRAM, DPUs, MRAM transfers, transposition), or idle power. Establish standardized instrumentation and reporting (e.g., shunt-based per-DIMM measurement, firmware counters), and present energy-per-byte/operation metrics.
- Reliability, yield, and fault tolerance: The presence of faulty DPUs is noted but not analyzed. Characterize failure modes (DPU vs MRAM), ECC on MRAM/WRAM/IRAM, error rates, and environmental factors; design runtime policies for detection, isolation, remapping, and graceful degradation in large deployments.
- Security and isolation: Security implications (e.g., MRAM DMA access control, side-channel/leakage across tasklets/DPUs, integrity of transposition routines) are not discussed. Define threat models and evaluate isolation mechanisms (permissioned MRAM regions, per-DPU sandboxes, secure DMA, attestation) for multi-tenant scenarios.
- Fairness and coverage in CPU/GPU comparisons: The study uses specific CPU/GPU baselines; the optimization levels, library choices, and hardware generations significantly affect results. Extend comparisons to diverse, current GPU/CPU architectures, report tuning details, and include sensitivity analyses (e.g., batch sizes, tiling, precision) for fairness and reproducibility.
- Generalizability to broader workloads: PrIM contains 16 workloads, but important classes (e.g., modern DNN training/inference pipelines with mixed precision, key-value stores, streaming analytics, graph algorithms beyond BFS) are underexplored. Add representative workloads, especially those with irregular access and dynamic communication, and study end-to-end pipelines including host-side orchestration.
- Multi-node scale-out and memory-controller contention: The interaction between many UPMEM DIMMs and host memory controllers/channels, especially in multi-socket systems, is not deeply analyzed. Characterize controller-level contention, scheduling policies, and cross-socket traffic; explore topology-aware placement and NUMA-aware runtime policies.
- Tasklet scheduling, load balancing, and heterogeneity: With ≥11 tasklets needed to fill the pipeline, the impact of >11 on contention and diminishing returns is only partially shown. Develop runtime mechanisms for dynamic load balancing, work stealing across DPUs, and heterogeneous tasklet configurations tuned to kernel characteristics.
- Impact of MRAM capacity limits on algorithm design: Each DPU has 64 MB MRAM, potentially forcing partitioning/tiling for large datasets. Quantify partitioning overheads, inter-DPU merging costs, and algorithmic transformations required; provide reusable patterns for scalable tiling with minimal host interaction.
- Tooling, profiling, and observability: There is limited visibility into DPU microarchitectural events (stalls, pipeline occupancy, DMA overlap). Develop and validate profiling tools (performance counters, trace collection, ISA-level instrumentation) and integrate them with standard toolchains for actionable feedback to developers.
- Co-design of CPU–PIM execution: The paper does not explore optimal division of labor between host and DPUs (e.g., control-heavy phases on CPU, data-parallel phases on DPUs) under asynchronous execution. Create scheduling frameworks that co-optimize partitioning, transfer overlap, and resource usage to minimize end-to-end time and energy.
- Thermal and power-density constraints: Large UPMEM deployments may face thermal limits; the paper does not provide thermal/power-density characterization. Measure per-DIMM/DPU thermal behavior under sustained workloads, assess cooling requirements, and study DVFS or power capping policies for safe, efficient operation.
- Cross-architecture PIM benchmarking: Results focus on UPMEM; there is no comparative study against other PIM approaches (3D-stacked PNM, analog PUM). Build portable benchmarks and evaluation methodology to compare performance/energy, programmability, and cost across PIM architectures.
- Memory consistency and semantics for shared MRAM: Since host/DPU cannot simultaneously access a bank and coherence is absent, formalize memory consistency models and design minimal hardware/software coherence or explicit ownership-transfer semantics; evaluate their programmability and performance impacts.
- API/runtime evolution for irregular, variable-sized data: Current SDK favors regular, equal-sized parallel transfers and static SPMD tasking. Propose runtime features for irregular data (scatter/gather, variable-sized partitions), asynchronous callbacks, and collective operations (reduce, all-to-all) optimized for PIM constraints.
- Quantification of synchronization-induced bottlenecks in specific apps: While BFS and NW are called out as problematic, the exact breakdown (communication vs compute vs synchronization) is not provided. Perform fine-grained tracing and quantify bottlenecks to guide algorithmic redesign (e.g., frontier compression, hierarchical reductions).
Collections
Sign up for free to add this paper to one or more collections.