STAR Memory Architecture is a post-hierarchical memory design that decomposes the address space into five classes (SRAM, StRAM, DRAM, LtRAM, NAND) with distinct retention, latency, and density features.
It leverages explicit OS-level abstractions to dynamically allocate and migrate data based on application-specific access intensity and data lifetime, enhancing overall system efficiency.
Evaluation results demonstrate significant improvements, such as 33% lower read energy in ML inference and a 2.25× speedup in pointer traversal compared to conventional hierarchical memory systems.
The STAR memory architecture is a post-hierarchical memory design that introduces explicit operating system (OS) abstractions for Short-Term RAM (StRAM) and Long-Term RAM (LtRAM), in addition to conventional SRAM, DRAM, and NAND flash. By tailoring memory management policies and device selection to application-specific data lifetime, access intensity, and endurance requirements, STAR targets improved system-level trade-offs in cost, density, energy efficiency, and performance. Underlying this paradigm is a recognition of the stagnation in traditional SRAM and DRAM scaling and the urgent need for scalable, workload-optimized alternatives that transcend rigid vertical hierarchies (Li et al., 5 Aug 2025).
1. Architectural Overview and Memory Class Taxonomy
STAR decomposes the main memory address space into five peer classes: SRAM, StRAM, DRAM, LtRAM, and NAND flash. Only StRAM and LtRAM are newly exposed to OS-level control; the others persist as established memory classes. Each class is characterized by distinct physical retention, endurance, energy, and latency properties, thereby enabling application-driven memory placement—particularly for data with nonuniform temporal and spatial reuse.
The following table summarizes the core distinguishing parameters of each memory class:
Class
Retention
Latency (R/W)
Endurance
Density
Static Power
Primary Uses
SRAM
∞ (powered)
1 ns / 1 ns
∞
0.02 μm²/bit
high
L1/L2 caches
StRAM
10–100 ms
1–3 ns / 1–3 ns
∞
0.01 μm²/bit
low
Buffers, queues, ML activations
DRAM
32 ms
15 ns / 15 ns
∞ (refresh)
0.055 μm²/bit
medium
Main memory
LtRAM
≥60 s
8–20 ns /50–200 ns
10⁶–10⁹ writes/cell
0.005 μm²/bit
~0
Model weights, code, indices
NAND
∞ (persistent)
10 μs / 1 ms
10³–10⁴ erases
0.0005 μm²/bit
~0
Archival storage
Within this classification, STAR enables the OS and system layers to allocate, migrate, and evict data objects directly into memory classes that are optimized for their individual kinetic and spatial footprints (Li et al., 5 Aug 2025).
2. Device Technologies and Scalability
Device selection for StRAM and LtRAM targets both the required retention window and cost/density scaling:
StRAM: Implemented via gain-cell eDRAM (2T/3T), advanced SRAM variants, or similar short-retention, high-write endurance arrays (retention ~10–100 ms, cell area ≤0.01 μm²). StRAM arrays may replace or supplement L2 caches for ephemeral, high-bandwidth, temporally local data.
LtRAM: Realized via resistive RAM (RRAM)—including 3D vertical (V-RRAM) stacks of 8–64 layers—magnetoresistive RAM (MRAM, especially STT-MRAM), and ferroelectric RAM (FeRAM). Managed-retention DRAM hybrid schemes are also viable. Typical retention exceeds 60 seconds (no refresh), endurance up to 10⁹ writes/cell, and densities as low as 0.005 μm²/bit (planar) or 0.002 μm²/bit (3D V-RRAM at 64 layers). These properties enable large working sets (e.g., model weights in ML inference) to be mapped to non-volatile, high-density, read-optimized media (Li et al., 5 Aug 2025).
Scalability is evident: RRAM technologies achieve 2×–10× the density of HBM-DRAM at equivalent nodes, and projected cost per GB of LtRAM is $0.005, compared to$0.03 for DRAM, and $0.20 for HBM. This suggests LtRAM devices are viable for large-scale, cost-sensitive deployments.
3. Quantitative Trade-off Models and Design Optimization
STAR introduces formal models to balance performance, cost, and endurance across heterogeneous memory. Energy and access behaviors for a data object $iplacedinclassjaremodeledas:</p><p>E_\mathrm{total}(i,j) = N_r^i \cdot E_{rj} + N_w^i \cdot E_{wj} + P^{\mathrm{stat}}_j \cdot T_i</p><p>whereN_r^i,N_w^i,andT_i$ are the object's read, write counts, and active lifetime; $E_{rj},E_{wj},andP^{\mathrm{stat}}_jareper−accessenergiesandleakagepowerforclassj.</p><p>Placementoptimizationisstatedas:</p><p>\min \sum_i \left[ \sum_j x_{ij} \cdot (E_\mathrm{total}(i,j) + C^{pb}_j \cdot S_i) \right]</p><p>subjecttobandwidth,endurance,andexclusivityconstraints,wherex_{ij}isabinaryassignment,C^{pb}_jiscost−per−bit,andS_iisobjectsize.</p><p>Volatilecells(StRAM,DRAM)requirerefresh,adding:</p><p>E_{\mathrm{refresh}}(j) = \lceil T_i / t_r \rceil \cdot E_{rj} \cdot S_n</p><p>STARthusenablesbothper−objectandglobalcost−minimizationacrossefficiencyfrontiers(<ahref="/papers/2508.02992"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Lietal.,5Aug2025</a>).</p><h2class=′paper−heading′id=′system−level−management−and−non−hierarchical−integration′>4.System−LevelManagementandNon−HierarchicalIntegration</h2><p>OSandsystemsoftwareexposeSTARclassesthroughextensionstoallocationAPIs(e.g.,<code>malloc(size,RAMCLASS=STRAM∣LTRAM)</code>),memorymappingflags(<code>mmap(...,MAPSTRAM)</code>),andpageadvicecalls(<code>madvise(addr,len,MADVSTRAM)</code>).Hardwarecountersandcompilerattributescanannotatepagelifetimesandupdaterates,supportingautomatedorsemi−automateddataplacement:</p><ul><li>IfactivetimeT_\mathrm{active} > \tau_\mathrm{thresh}withlowwriteintensity,pagesaremappedtoLtRAM.</li><li>Forhighupdateintensity,StRAMispreferred.</li><li>DefaultorambiguousobjectsaremappedtoDRAM.</li></ul><p>EvictionandpromotionareperformedbyperiodicOSdaemonsorhardwareevents,migratingpageswhenreuseoraccessstatisticscrossclassthresholds(<ahref="/papers/2508.02992"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Lietal.,5Aug2025</a>).</p><p>Memoryclassesformaflat,non−hierarchicaladdressspace;theOSorchestratesmovementandaccessbasedonfit−for−purposecriteria,ratherthanstrictlyplacinghotdata“closer”tocompute.</p><h2class=′paper−heading′id=′performance−and−evaluation−results′>5.PerformanceandEvaluationResults</h2><p>ExperimentalandsimulatedresultsforSTAR−baseddeploymentsdemonstratesignificanttrade−offsrelativetoconventionalhierarchies:</p><ul><li><strong>MachineLearningInference(70B−parametermodel):</strong>LtRAM(RRAM,3D−64)yields330.03/GB), while HBM achieves 120 mJ/query at$0.20/GB.
Graph/Pointer Traversal: StRAM (eDRAM) provides a 2.25× speedup in lookup latency (200 ns vs. 450 ns for DRAM) and halves energy (20 nJ vs. 40 nJ).
Scaling: DRAM and SRAM cell areas have plateaued at 0.055 μm²/bit and 0.02 μm²/bit, while RRAM (3D V-RRAM) reaches 0.002 μm²/bit and yields 10× HBM density (Li et al., 5 Aug 2025).
For sub-ns read/write access at μW-class power, cross-point STT-MRAM arrays with balanced sneak-current suppression and parallelism further enable dense, high-throughput, ultra-low-latency LtRAM blocks (Zhao et al., 2012).
A canonical instantiation of LtRAM for STAR is the cross-point STT-MRAM. In this structure, each word comprises $N<ahref="https://www.emergentmind.com/topics/magnetic−tunnel−junctions−mtjs"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">magnetictunneljunctions</a>(MTJs)withjusttwosharedselectiontransistors.Themeancellareaperbit(A_{\rm cell})is:</p><p>A_{\rm cell} = \frac{2A_\mathrm{trans} + N \cdot A_\mathrm{MTJ}}{N}</p><p>ForA_\mathrm{trans}=56F^2andA_\mathrm{MTJ}=F^2,A_{\rm cell} \approx 1.75F^2forN=64.Balanceddifferentialsenseamplifiers,combinedwith<ahref="https://www.emergentmind.com/topics/additive−parallel−correction"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">parallel</a>readschemesandself−enabledwrites,suppresssneakcurrentsandmaximizeperformance,reducingreadcurrent(e.g.,from35μAto18μAperwordatN=4,knowntoyield17<p>TheswitchingphysicsfollowsthemacrospinLandau−Lifshitz−GilbertequationwithSlonczewskispin−torqueterm,wherethecriticalcurrentforswitchingis:</p><p>I_c = \frac{2e \alpha}{\hbar \eta} M_s V_{\rm MTJ}(H_k + H_d)</p><p>Empiricalvalues(CoFeB/MgO/CoFeB,65nm)yieldsub−2 nsread/writelatencies,word−parallelaccess,infiniteendurance(> 10^{15}$ cycles), and densities superior to 1T–1MTJ MRAM (Zhao et al., 2012).
7. Research Challenges and Roadmap
Further development of STAR mandates advances in high-density, high-endurance LtRAM devices with sub-20 ns read access, reliable refresh controllers for StRAM, multi-class profiling, and page migration infrastructure, as well as integration of non-hierarchical allocation and adaptive cache coherence. Packaging, interconnect, and thermal solutions for large-scale, high-bandwidth 3D-stacked LtRAM forms a parallel research avenue (Li et al., 5 Aug 2025).
The STAR memory architecture, by explicitly incorporating fit-for-purpose memory classes via StRAM and LtRAM, offers a technically grounded path to extend system scalability and efficiency beyond the limitations imposed by SRAM/DRAM scaling ceilings and rigid hierarchical models. It synthesizes advances in non-volatile circuits, OS-level management, and cost-aware optimization for next-generation heterogeneous computing.