WWW: What, When, Where to Compute-in-Memory

Published 26 Dec 2023 in cs.AR, cs.DC, and cs.LG | (2312.15896v3)

Abstract: Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture-evaluation methodology with tailored mapping algorithm. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our analysis considers the integration of CiM prototypes into the cache levels of a tensor-core-like architecture, and shows that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to established baseline with INT-8 precision. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.