In-DRAM Primitives: In-Memory Compute
- In-DRAM primitives are low-level computational operations executed directly within DRAM, utilizing charge-sharing and sense amplifiers for in-place data processing.
- They enable efficient bulk data movement and bitwise logic, with primitives like RowClone and triple-row activation offering significant latency and energy improvements.
- By integrating computation with memory, these primitives support a memory-centric computing paradigm that minimizes data transfer overhead and enhances system performance.
In-DRAM primitives are low-level computational operations directly executed within DRAM chips without requiring any data transfer to or from the processor or external accelerator. These operations leverage the analog and digital properties of DRAM circuit elements—such as charge sharing, sense amplifiers, and wordline/bitline coupling—to implement data movement, logic, arithmetic, and value-generation functions entirely inside memory. By exploiting these in-place transformations, In-DRAM primitives can achieve orders-of-magnitude improvements in latency and energy efficiency for operations on large memory-resident datasets. This capability underpins a broader Processing Using (or In) Memory (PUM/PIM/PUD) paradigm, enabling new architectural strategies for scalable, memory-centric computing.
1. DRAM Circuit Fundamentals and Primitives
DRAM chips are architected as arrays of banks, each with multiple subarrays containing thousands of rows and columns. The essential mechanism underpinning most In-DRAM primitives is the ability to rapidly transfer, sense, and manipulate the analog charge state of multiple DRAM cells via their shared bitlines and sense amplifiers.
A canonical primitive is RowClone, which exploits the fact that activating (ACTIVATE) a DRAM row moves an entire row of cell data into the sense amplifiers. If two rows within the same subarray are connected to the same sense latches, copying is reduced to two back-to-back ACTIVATEs and a PRECHARGE. Other primitives employ similar microarchitectural features, orchestrating timing and command sequences to induce desired transformations (e.g., triple-row activation for bitwise logic; programmable delays for value generation).
Key DRAM timing constraints—such as (ACTIVATE to READ/WRITE), (row active), and (precharge)—are often intentionally violated or customized to enable unique behavior required for these primitives (Seshadri et al., 2016).
2. Categories and Mechanisms of In-DRAM Primitives
The main classes of In-DRAM primitives, as established in research, are summarized below.
| Primitive | Mechanism Summary | Typical Operations |
|---|---|---|
| RowClone | Back-to-back ACTIVATEs to rows in same subarray; sense amps | Bulk copy, zero/init |
| Triple-Row Activation (IDAO) | Simultaneous activation of three rows; analog charge-sharing resolves to majority | Bitwise AND, OR |
| Programmable Timing (e.g., CODIC) | Fine-grained control of wordline, sense amp, and precharge delays | PUF, RNG, secure erase |
| k-Row Activation | Simultaneous k-row ACTs with reference rows for analog thresholding | k-input AND/OR/NAND/NOR |
| Value Generation (Dataplant) | Margining SA/cell behavior via timing to expose process variation | PUF, cold-boot erase |
| StoB Conversion (AGNI) | Peripheral analog units convert stochastic bitlines to binary | In-DRAM popcount/accum |
- Bulk Copy/Initialization (RowClone):
- Intra-subarray Fast-Parallel Mode (FPM): Two back-to-back ACTIVATEs overwrite the destination row's cells with the source, leveraging the sense amplifier as a data bus. For a 4 KB row, FPM yields 85 ns latency, lower energy than the conventional CPU copy, and up to speedup (Seshadri et al., 2016).
- Inter-bank or inter-subarray Pipelined-Serial Mode (PSM): Employs a new TRANSFER command to move cache lines between banks, still fully on-chip.
- Bitwise Logic (IDAO, k-row):
- Triple-row activation enables a sense amplifier to resolve the majority of three connected cells, enabling efficient AND/OR (and via composition, NAND/NOR/XOR). For 4 KB, AND via triple-row activation completes in 200–340 ns, with energy reductions up to over the baseline (Seshadri et al., 2016).
- k-input AND/OR/NAND/NOR can be realized by carefully controlling reference rows and analog thresholds during activation, enabling scalable, functionally-complete Boolean logic (Yuksel et al., 2024).
- In-DRAM Arithmetic and Bit-Serial SIMD:
- Frameworks such as SIMDRAM generalize the use of majority (MAJ) and NOT gates implemented via triple-row and double-row DRAM activations. Bit-serial arithmetic primitives (add, multiply, eq, gt) are constructed from these gates, mapped to DRAM rows/columns such that each column acts as an independent SIMD lane (Hajinazar et al., 2020).
- Proteus and similar substrates extend the basic model with dynamic bit-precision, parallel subarray mapping (OBPS), and adaptive micro-architectural selection to optimize latency and energy for variable-width operations (Oliveira et al., 29 Jan 2025).
- Value Generation Primitives (PUF, RNG, Secure Erase):
- Dataplant: Fine-grain timing manipulation of sense amplifier and PRECHARGE events to leverage process variation for robust, fast, and reproducible physical unclonable function (PUF) responses and cold-boot attack mitigation. Achieves <0.5% intra-chip bit-flip rate across C–C and 19.5 speedup versus software memory scrub (Orosa et al., 2019).
- CODIC: Introduces programmable internal delay elements for wordline, EQ, and sense amplifier edges, enabling high-throughput, robust PUF signature extraction and sub-second self-destruction of DRAM contents at power-on (Orosa et al., 2021).
- D-RaNGe: By intentionally violating the activation-to-read delay (), exploits random activation failures as a high-entropy TRNG with up to 8.3 Mb/s throughput (Olgun et al., 2022).
- Specialized Analog/Hybrid Arithmetic Primitives:
- AGNI: Minimal modifications to DRAM peripherals enable in-DRAM stochastic-to-binary conversion (StoB) for deep learning accelerators, with 55 ns iso-latency operation for 16–256 bits, yielding – lower EDP than prior designs (Shivanandamurthy et al., 2023).
- PIM-DRAM and MVDRAM: Leverage charge-sharing and multi-row activation for multiply-accumulate (MAC) primitives, including analog inner-products for matrix–vector multiplication without core modification (Roy et al., 2021, Kubo et al., 31 Mar 2025).
3. Microarchitectural Modifications and Overheads
Implementation of In-DRAM primitives generally requires only modest changes to the DRAM chip, controller, and host software:
- Sense-amp and wordline command schedulers are extended to allow non-standard, back-to-back ACTIVATEs (e.g., RowClone FPM), additional reserved rows per subarray for scratch and constant logic values, and, where applicable, support for new commands (e.g., TRANSFER, AAP for triple-row activation) (Seshadri et al., 2016).
- Peripheral additions (e.g., AGNI’s analog lane and comparators, Dataplant/CODIC's programmable delay gates) incur area overheads typically below 1% per chip (Shivanandamurthy et al., 2023, Orosa et al., 2019, Orosa et al., 2021).
- Memory controller logic partitions memory requests into primitive-aligned chunks, manages OS-exposed register state (e.g., minimum granularity, subarray mapping), and ensures coherence via cache flush/evict where needed (Seshadri et al., 2016).
- Operating systems require new allocation routines to guarantee subarray-locality for FPM and may expose custom system calls for primitive invocation (Olgun et al., 2021).
4. Performance, Energy, and Application Impact
Numerous studies have benchmarked In-DRAM primitives:
| Operation | Latency (ns) | Energy (µJ) | Speedup vs. Baseline |
|---|---|---|---|
| RowClone-FPM | 85 | 0.04 | 12× |
| RowClone-PSM | 510 | 1.1 | 2× |
| IDAO (AND/OR) | 200–320 | 0.10–0.16 | 4.8×–7.6× |
| D-RaNGe (TRNG) | 220 | — | — |
| AGNI StoB | 55 | — | 28–350× lower EDP |
Application-level effects are also pronounced:
- fork()/copy-on-write: up to 2.2× speedup, 80% DRAM energy reduction (Seshadri et al., 2016)
- Bulk zeroing: 1.7× IPC gain, 41× DRAM energy reduction (Seshadri et al., 2016)
- FastBit bitmap queries: 30% end-to-end query speedup (Seshadri et al., 2016)
- DNN inference (MVDRAM): $1.3$– speedup and $2.3$– energy savings for LLM matrix-vector multiplication (Kubo et al., 31 Mar 2025).
5. Limitations, Challenges, and Extensions
Granularity and Alignment: Most high-throughput modes (e.g., RowClone FPM) require source and destination to be aligned in the same subarray and operate at whole-row granularity; fallback to PSM or serial modes reduces speedup (Seshadri et al., 2016).
Functional Scope: While basic bitwise and bulk data movement primitives are robust, generalizing to arbitrary arithmetics requires complex multi-stage command sequences and more scratch rows, especially for large bitwidth or composite arithmetic operations (Hajinazar et al., 2020, Oliveira et al., 29 Jan 2025).
Reliability and Variation: Process, temperature, and data pattern variation impact operation success rates and device-to-device consistency. For example, even functionally-complete Boolean logic (up to 16-input) achieves 94–98% average success rate in modern DDR4, with <2% degradation across C–C (Yuksel et al., 2024).
Coherence and Integration Overhead: The necessity to flush/evict or invalidate in-cache lines prior to primitive execution, as well as memory allocation constraints for intra-subarray primitives, introduces overhead that must be balanced against intrinsic speedup (Olgun et al., 2022).
Extensibility: Multiple works have proposed extensions:
- Arbitrary logic (NAND/NOR/XOR) via programmable triple/k-row activation and control row utilization (Seshadri et al., 2016, Hajinazar et al., 2020, Yuksel et al., 2024).
- Data-aware dynamic bit-precision, subarray-level fine-grained parallelism, and adaptive algorithm selection for higher throughput and energy efficiency (Oliveira et al., 29 Jan 2025).
- Value generation and scaling of PUF/rng primitives to future DRAM generations and memory technologies (Orosa et al., 2021, Orosa et al., 2019).
6. Cross-Layer System and Programming Support
Robust exploitation of In-DRAM primitives requires integration at all layers:
- FPGA-based frameworks such as PiDRAM provide end-to-end experimental infrastructure for deploying, evaluating, and extending In-DRAM primitives on real commodity DRAM (Olgun et al., 2022, Olgun et al., 2021).
- Programming models and runtimes (e.g., bbop in Proteus, pLUTo/Shared-PIM) provide abstractions for mapping high-level computational kernels to underlying In-DRAM command sequences, exposing interfaces for precision, mapping, and operation type selection (Oliveira et al., 29 Jan 2025, Mamdouh et al., 2024).
- Operating system support for page/subarray-aware memory allocations, synchronization, and security primitives is necessary for correctness and maximal yield.
7. Future Directions and Prospects
The ongoing trajectory of In-DRAM primitive research includes:
- Broader support for functionally-complete logic (beyond AND/OR/NOT) in COTS DRAM without device-specific tuning (Yuksel et al., 2024).
- Enhanced reliability and programmability (CODIC-style) to enable adaptive operation tuning and debugging (Orosa et al., 2021).
- High-precision, in-place analog operations, higher-order reductions, and reconfigurable adaptation to new applications (e.g., LLM accelerators, security co-processors) (Shivanandamurthy et al., 2023, Kubo et al., 31 Mar 2025).
- Standardization of controller and software stack interfaces, e.g., via new ISA extensions and mmapped command interfaces, to facilitate application-level deployment and cross-platform compatibility (Olgun et al., 2021, Oliveira, 27 Aug 2025).
- Exploration of In-DRAM primitives in new memory technologies, such as 3D-stacked DRAM and NVM, leveraging their unique circuit properties for further acceleration and security primitives (Orosa et al., 2019).
In summary, In-DRAM primitives emerge as a foundational mechanism to tightly couple low-overhead, high-throughput compute with memory systems, fundamentally altering the balance of data movement, compute, and storage in next-generation systems (Seshadri et al., 2016).