WWW: What, When, Where to Compute-in-Memory
Abstract: Matrix multiplication is the dominant computation during Machine Learning (ML) inference. To efficiently perform such multiplication operations, Compute-in-memory (CiM) paradigms have emerged as a highly energy efficient solution. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial than standard processing cores. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture-evaluation methodology with tailored mapping algorithm. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our analysis considers the integration of CiM prototypes into the cache levels of a tensor-core-like architecture, and shows that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to established baseline with INT-8 precision. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.
- Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE international symposium on performance analysis of systems and software (ISPASS), pages 304–315. IEEE, 2019.
- Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2019.
- Future computing hardware for ai. In 2018 IEEE International Electron Devices Meeting (IEDM), pages 1–3. IEEE, 2018.
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
- Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news, 23(1):20–24, 1995.
- In-memory computing: Advances and prospects. IEEE Solid-State Circuits Magazine, 11(3):43–55, 2019.
- Resistive crossbars as approximate hardware building blocks for machine learning: Opportunities and challenges. Proceedings of the IEEE, 108(12):2276–2310, 2020.
- Jae-Sun Seo. Advances and trends on on-chip compute-in-memory macros and accelerators. In 2023 60th ACM/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2023.
- A modern primer on processing in memory. In Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann, pages 171–243. Springer, 2022.
- S-flash: A nand flash-based deep neural network accelerator exploiting bit-level sparsity. IEEE Transactions on Computers, 71(6):1291–1304, 2022.
- Duality cache for data parallel acceleration. In Proceedings of the 46th International Symposium on Computer Architecture, pages 397–410, 2019.
- Multi-layer in-memory processing. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 920–936. IEEE, 2022.
- Livia: Data-centric computing throughout the memory hierarchy. In ASPLOS, pages 417–433, 2020.
- NVIDIA. Ampere Architecture. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth.
- A 28-nm compute sram with bit-serial logic/arithmetic operations for programmable in-memory vector computing. IEEE Journal of Solid-State Circuits, 55(1):76–86, 2020.
- Xin Si and et al. A local computing cell and 6t sram-based computing-in-memory macro with 8-b mac operation for edge ai chips. IEEE Journal of Solid-State Circuits, 56(9):2817–2831, 2021.
- A 65 nm 1.4-6.7 tops/w adaptive-snr sparsity-aware cim core with load balancing support for dl workloads. In 2023 IEEE Custom Integrated Circuits Conference (CICC), pages 1–2, 2023.
- Yu-Der Chih et al. 16.4 an 89tops/w and 16.3tops/mm2 all-digital sram-based full-precision compute-in memory macro in 22nm for machine-learning edge applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), volume 64, pages 252–254, 2021.
- Haruki Mori et al. A 4nm 6163-tops/w/b 𝟒𝟕𝟗𝟎−𝐓𝐎𝐏𝐒/𝐦𝐦𝟐/𝐛4790𝐓𝐎𝐏𝐒superscript𝐦𝐦2𝐛\mathbf{4790-TOPS/mm^{2}/b}bold_4790 - bold_TOPS / bold_mm start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT / bold_b sram based digital-computing-in-memory macro supporting bit-width flexibility and simultaneous mac and weight update. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 132–134, 2023.
- Ping-Chun Wu et al. A 28nm 1mb time-domain computing-in-memory 6t-sram macro with a 6.6ns latency, 1241gops and 37.01tops/w for 8b-mac operations for edge-ai devices. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1–3, 2022.
- Hidehiro Fujiwara et al. A 5-nm 254-tops/w 221-tops/mm2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage-frequency scaling and simultaneous mac and write operations. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1–3, 2022.
- cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- Nvidia Docs. Matrix Multiplication Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, 2020-23.
- Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
- To pim or not for emerging general purpose processing in ddr memory systems. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 231–244, 2022.
- Benchmarking and modeling of analog and digital sram in-memory computing architectures. arXiv preprint arXiv:2305.18335, 2023.
- Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091, 2019.
- Mark Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC), pages 10–14. IEEE, 2014.
- Benchmarking in-memory computing architectures. IEEE Open Journal of the Solid-State Circuits Society, 2:288–300, 2022.
- Impulse: A 65-nm digital compute-in-memory macro with fused weights and membrane potential for spike-based sequential learning tasks. IEEE Solid-State Circuits Letters, 4:137–140, 2021.
- Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference. In ASPLOS, 2019.
- 15.3 a 351tops/w and 372.4gops compute-in-memory sram macro in 7nm finfet cmos for machine-learning applications. In 2020 IEEE International Solid-State Circuits Conference - (ISSCC), pages 242–244, 2020.
- Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH computer architecture news, 45(2):27–40, 2017.
- Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings. IEEE micro, 40(3):20–29, 2020.
- Zigzag: A memory-centric rapid dnn accelerator design space exploration framework. arXiv preprint arXiv:2007.11360, 2020.
- Accelergy-Timeloop Processing-in-memory (PIM) Example. https://github.com/Accelergy-Project/processing-in-memory-design, 2020.
- Scaling equations for the accurate prediction of cmos device performance from 180 nm to 7 nm. Integration, 58:74–81, 2017.
- Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Machine Learning Accelerator War. https://www.sigarch.org/an-academics-attempt-to-clear-the-fog-of-the-machine-learning-accelerator-war/.
- A 22nm 832kb hybrid-domain floating-point sram in-memory-compute macro with 16.2-70.2 tflops/w for high-accuracy ai-edge devices. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 126–128. IEEE, 2023.
- A 28nm 16.9-300tops/w computing-in-memory processor supporting floating-point nn inference/training with intensive-cim sparse-digital architecture. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 1–3. IEEE, 2023.
- Bonan Yan et al. A 1.041-mb/mm2 27.38-tops/w signed-int8 dynamic-logic-based adc-less sram compute-in-memory macro in 28nm with reconfigurable bitwise operation for ai and embedded applications. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 188–190, 2022.
- Towards adc-less compute-in-memory accelerators for energy efficient deep learning. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 624–627. IEEE, 2022.
- CUTLASS. https://github.com/NVIDIA/cutlass, 2023.
- Accelwattch: A power modeling framework for modern gpus. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, pages 738–753, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.