Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chiplet-Based Memory Modules

Updated 27 January 2026
  • Chiplet-based memory modules are modular systems assembled from distinct chiplets like DRAM, SRAM, and controllers to optimize performance and yield.
  • They leverage 2.5D and 3D packaging with high-speed interconnects to overcome limitations of monolithic dies and tailor bandwidth, latency, and capacity.
  • These architectures enable HPC and ML accelerators to achieve energy efficiency, fine-grained resource scaling, and flexible design trade-offs.

Chiplet-based memory modules are modular, physically disaggregated memory sub-systems assembled from independently fabricated chiplets—specialized dies such as DRAM, SRAM, non-volatile memory, or memory controllers—interconnected within a single package or on a shared interposer. By leveraging advanced packaging (2.5D or 3D integration), chiplet-based memory modules address the historic bottlenecks of monolithic dies: limited reticle size, technology lock-in, diminishing yield with die area scaling, and sub-optimal trade-offs between bandwidth, latency, and memory capacity. Modern accelerators for high-performance computing (HPC) and ML exploit these architectures to achieve algorithm-tailored bandwidth, maximize compute-to-memory proximity, and enable fine-grained resource scaling for diverse workloads (Scheffler et al., 13 Jan 2025, Wang et al., 19 Nov 2025, Sharma et al., 7 Oct 2025, Sharma et al., 2023, Kiyawat et al., 15 Nov 2025, Orenes-Vera et al., 2023, Krishnan et al., 2021, Peng et al., 2023, Paulin et al., 2024).

1. Physical Organization and Packaging Technologies

Chiplet-based memory modules use a diverse set of integration strategies, typically falling into 2.5D (multiple dies on a passive silicon or organic interposer) or 3D (vertical stacking with through-silicon vias, TSVs) approaches. An illustrative example is Occamy, where two 73 mm² compute chiplets (fabricated in GlobalFoundries 12 nm FinFET) and two HBM2E memory stacks are mounted face-down on a 65 nm passive silicon interposer (“Hedwig”). This interposer provides up to sixteen power/ground domains, routes high-speed HBM2E data channels (eight per stack, <4.9 mm length, 2.5 µm trace/4.1 µm pitch), and supports point-to-point die-to-die (D2D) links with energy-efficient signaling (~1.6 pJ/bit) (Scheffler et al., 13 Jan 2025, Paulin et al., 2024).

Other approaches, such as UCIe-based memory modules, enable direct logic-to-memory (SoC-to-DRAM or SoC-to-LPDDR6) connectivity using advanced PHYs at fine bump pitches (down to 25 µm), significantly improving areal bandwidth density relative to conventional wide, parallel DRAM buses (Sharma et al., 7 Oct 2025). Heterogeneous integration (HI), where DRAM, SRAM, non-volatile ReRAM, and controller logic are fabricated in independent process nodes, further optimizes per-component performance and yield (Sharma et al., 2023, Kiyawat et al., 15 Nov 2025, Wang et al., 19 Nov 2025).

2. Memory Architecture, Interconnect Topology, and Dataflow

Chiplet-based memory architectures are characterized by hierarchical, multi-level interconnects and explicitly managed data movement. In Occamy, for example, each compute chiplet attaches to an HBM2E stack through eight wide PHY+controller macros, interfacing to the chiplet core clusters via a crossbar hierarchy: cluster-level (nine cores sharing 128 KiB scratchpad RAM), group-level (four clusters per group with independent 512-bit crossbars, 64 GB/s), and chiplet-level (aggregating to 381.5 GB/s raw DRAM bandwidth per chiplet; two chiplets → 763 GB/s aggregate) (Scheffler et al., 13 Jan 2025). The D2D links allow for distributed execution, with measured inter-chiplet access latencies in the 27–61 cycle range.

A different design employed in Hemlet uses a heterogeneous collection of analog-CIM (RRAM), digital-CIM (SRAM), and intermediate data process (IDP) chiplets. These are interconnected via a mesh network-on-package (NoP), with up to eight unidirectional links per chiplet (32 GB/s per link), supporting total NoP bandwidth up to 256 GB/s (Wang et al., 19 Nov 2025). In highly parallel systems such as SIAM or DCRA, the interconnect takes the form of a folded torus or 2D mesh at both intra-die (NoC) and inter-die (NoP) levels, facilitating reconfigurability and robust scaling for irregular or sparse workloads (Krishnan et al., 2021, Orenes-Vera et al., 2023).

The dataflow in these systems is typically managed via scratchpad- or DMA-driven explicit data movement, eschewing general-purpose caching in favor of deterministic latency, bandwidth partitioning, and contention management. The lack of hardware cache coherence and the use of software-managed memory hierarchies is prevalent in high-throughput, deterministic accelerator designs (Scheffler et al., 13 Jan 2025, Sharma et al., 2023, Paulin et al., 2024).

3. Memory Technologies and Performance Metrics

DRAM-based chiplet modules usually employ wide I/O standards (HBM2E, HBM4, LPDDR6) delivered in stacked configurations, each channel delivering on the order of ~50 GB/s with eight or more channels per stack. For instance, the Occamy architecture has each HBM2E stack delivering 8 × 47.68 GB/s ≈ 381.5 GB/s per chiplet (Scheffler et al., 13 Jan 2025). Performance is characterized by sustained bandwidth, latency (10–150 ns depending on memory level and data movement software), utilization (e.g., 83% FPU utilization for stencil codes in Occamy), and compute-to-bandwidth ratios (B/F).

Non-volatile memory technologies such as RRAM are used in in-memory-compute (IMC) modules, supporting multi-bit cell encoding (e.g., two bits per RRAM cell), analog or digital vector-matrix multiplication, and are often organized as crossbar arrays partitioned into tiles, groups, and processing engines (Wang et al., 19 Nov 2025, Krishnan et al., 2021).

Energy and area efficiency is central: in SIAM, a chiplet-based IMC system achieves up to 130× energy-efficiency relative to a V100 GPU, with per-flit interconnect energy ~0.54 pJ/bit (NoP) and crossbar compute energy modeled per operation (Krishnan et al., 2021). UCIe-based DRAM modules demonstrate up to 10× higher bandwidth density and up to 3× lower energy-per-bit vs. HBM4 or LPDDR6 (Sharma et al., 7 Oct 2025).

4. Design Trade-offs, Scalability, and Technology Flexibility

Chiplet-based memory modules expose multiple axes of design-time and run-time tradeoffs. At the silicon level, smaller chiplets yield higher manufacturing yield, enable per-function process optimization (logic at advanced nodes, DRAM at cost- or density-optimized nodes), and decouple reticle-size constraints (Scheffler et al., 13 Jan 2025, Orenes-Vera et al., 2023). However, increased chiplet count can increase package and interposer overhead, with crossbar and interconnect area potentially occupying up to 25% of total die area (Scheffler et al., 13 Jan 2025, Peng et al., 2023).

High-radix crossbars or bisection-bandwidth NoC designs improve aggregate bandwidth at an area and power penalty. Fine-grained power domains (core, PHY, DRAM, I/O), multi-level voltage/frequency islands, and hierarchical interconnect segmentation reduce system-level power consumption and thermal hotspots (Scheffler et al., 13 Jan 2025).

Systems such as Sangam demonstrate the separation of memory and compute chiplets—DRAM process for dense memory arrays, logic node for bank-attached PIM engines—enabling full DRAM capacity utilization, higher logic density and performance, and support for CXL-attached pooled memory (Kiyawat et al., 15 Nov 2025). UCIe-based modules offer technology-agnostic memory extension and cost-efficient reuse of commodity DRAM devices, further decoupling technology lock-in (Sharma et al., 7 Oct 2025).

Scalability is maintained by linear (or near-linear) replication: bandwidth, memory capacity, and compute resources scale with chiplet count, modulo global bottlenecks (NoP, DMA engine saturation, interposer bandwidth). Configurability at compile-time (e.g., SRAM cache vs. scratchpad partition, logical grid shape, software task mapping) and package-time (chiplet counts and placement) further enable hardware-software co-optimization across applications (Orenes-Vera et al., 2023, Peng et al., 2023).

5. System-Level Integration, Application Mapping, and Use Cases

Chiplet-based memory modules are now foundational in heterogeneous accelerators for ML (transformers, LLMs, ViTs), HPC (stencil, SpMM), and data-analytic workloads (graph, sparse linear algebra). Examples include:

  • ML Accelerators: Hemlet achieves 1.44–4.07× system speedup vs. monolithic CIM, supporting on-chip ViT model storage and high-throughput inference via analog/digital-heterogeneous arrays (Wang et al., 19 Nov 2025).
  • HPC: Occamy demonstrates up to 89% dense DGEMM FPU utilization and 42% on sparse-dense LA kernels, with up to 11× improvement in normalized compute density over previous designs (Scheffler et al., 13 Jan 2025, Paulin et al., 2024).
  • LLM Serving: Sangam modules, CXL-attached, act as GPU co-processors or replacements and enable order-of-magnitude throughput and energy gains for memory-bound LLM decoding kernels (Kiyawat et al., 15 Nov 2025). Chiplet Cloud aggregates thousands of chiplets to deliver petabyte-scale SRAM-backed memory for LLMs at up to 97× lower cost/Token than GPUs (Peng et al., 2023).

Tables elucidate decisive architectural parameters and performance characteristics:

System Memory Chiplets Integration Peak BW / Module Use Cases
Occamy 2 × HBM2E, 8 ch/stack 2.5D, Si interposer 763 GB/s Dense/sparse ML, HPC
Hemlet RRAM/SRAM CIM 2.5D, NoP mesh 256 GB/s (NoP) ViT Inference
Sangam DRAM+Logic (decoupled) 2.5D, CXL attach Up to 400 GB/s/chip LLM memory-bound kernels
Chiplet Cloud SRAM w/ Compression Intra-chip + 2D torus 2.75 TB/s/chip Large LLM serving

Memory performance, capacity, and interconnect configure according to the algorithmic dataflow’s demands: high-parallelism streaming, burst access, or irregular pointer-chasing all have optimized mapping strategies (Sharma et al., 2023, Orenes-Vera et al., 2023, Krishnan et al., 2021).

6. Analytical Models, Lessons Learned, and Design Guidelines

Quantitative performance modeling is integral for architecture tuning:

  • Bandwidth Formula (per chiplet): BWpeak=Ncbper chanfBW_{\text{peak}} = N_{c} \cdot b_{\text{per chan}} \cdot f_{\ell} where NcN_{c} is channel count, bper chanb_{\text{per chan}} is width, and ff_{\ell} is channel I/O frequency (Scheffler et al., 13 Jan 2025).
  • Effective Bandwidth per Tile (DCRA): EBW=BSRAMH+BHBM(1H)E_{\text{BW}} = B_{\text{SRAM}} \cdot H + B_{\text{HBM}} \cdot (1-H) where HH is cache/scratchpad hit rate (Orenes-Vera et al., 2023).
  • Latency Models: Aggregate of on-chip cycles, crossbar/interposer hops, and round-trip wire/PHY costs (Scheffler et al., 13 Jan 2025, Krishnan et al., 2021).
  • Energy per Bit/Access: Ranges from 0.18 pJ/bit (SRAM read) to 3.7 pJ/bit (HBM2E transfer), with interposer and NoP overhead tuned by signaling regime and distance (Orenes-Vera et al., 2023, Wang et al., 19 Nov 2025, Krishnan et al., 2021).

Key principles include:

Economic analysis (yield, manufacturing, assembly cost) favors smaller, more numerous chiplets for high-yield, flexible process targeting; however, this requires careful optimization of crossbar/interposer area, periphery wiring, and interconnect signaling to avoid diminishing returns with scaling (Peng et al., 2023, Sharma et al., 7 Oct 2025).

7. Outlook, Technology Gaps, and Research Directions

Chiplet-based memory modules represent an inflection point in memory and accelerator architecture. Open research topics include:

Chiplet-based memory modules will remain central to memory- and bandwidth-bound application domains as advanced packaging, process decoupling, and architectural co-optimization continue to redefine the limits of system performance, energy efficiency, and cost scaling.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chiplet-Based Memory Modules.