SRAM Reprogramming & Power Gating (SRPG)

Updated 21 January 2026

SRPG is an approach combining dynamic SRAM reprogramming and power gating, enabling efficient subarray-level control and scalable in-memory computing.
It utilizes architectural enhancements like dual-mode drivers, sleep transistors, and overlap scheduling to reduce power consumption and minimize latency.
Device-level innovations, including STT-MTJ and MEFET integration, ensure persistent data retention and achieve significant standby power reductions.

SRAM Reprogramming and Power Gating (SRPG) encompasses architectural and microarchitectural schemes enabling dynamic modification of SRAM cell contents (reprogramming), rapid and energy-efficient transitions to low-power retention states (power gating), and fine-grained resource utilization for scalable inference and nonvolatile data retention. SRPG subsumes a diverse set of approaches—bulk parallel write drivers, device-level enhancements (e.g., STT-MTJ or MEFET integration), bank- and subarray-level sleep transistors, and layered control protocols—spanning high-throughput inference accelerators (Chong et al., 20 Jan 2026), nonvolatile SRAM circuits (Monga et al., 2019), and emerging post-CMOS in-memory compute architectures (Najafi et al., 2023). The following sections detail the established methodologies, control logic, device/circuit innovations, energy and performance metrics, and architectural implications of SRPG.

1. Microarchitectural Modifications for SRPG

SRPG necessitates fundamental changes in the SRAM macro to support both reprogramming operations and power gating at spatially resolved granularity.

Subarray Partitioning and Bank Hierarchy: The exemplar SRAM-DCIM macro in PRIMAL (Chong et al., 20 Jan 2026) divides a 256×64 array into $M=8$ banks (256×8 bits per bank), each subdivided into $K=4$ subarrays (256×2 bits). This enables column-granular gating, aligning resource allocation with workload sparsity (such as LoRA rank-specific model inference) and mapping flexibility.
Local Power-Gating Transistors: Every bank integrates a header PMOS “sleep” transistor (SleepB) controlling the VDD rail; an auxiliary retention supply (VDD_RET) ensures data preservation when main VDD is gated.
Dual-Mode Write Drivers: Fast bulk reprogramming utilizes wide-bitline drivers in “reprogram mode,” switching to narrow “compute mode” drivers for energy-efficient computation.
Row Decoder Enhancements: Bypass logic within the row decoder permits one-cycle activation of entire banks during LoRA weight uploads.
Sleep-Wake Timing: Wake-up latency from retention to full VDD is characterized as $\tau_\text{wakeup} \approx 2$ ns ($2$ cycles at $1$ GHz), with negligible sleep entry delay ( $<1$ ns).

Device-level innovations complement these architectural changes. STT-MTJ-based nvSRAM cells (Monga et al., 2019) and MEFET-based ME-SRAM (Najafi et al., 2023) introduce nonvolatile storage elements and additional gating transistors for both backup-restore and normally-off operations.

2. Control Logic and Overlap Scheduling

SRPG integrates distributed FSMs and global schedulers for seamless coordination of reprogramming and power gating with computational pipelining.

Local SRPG FSMs: In PRIMAL, each Compute Tile (CT) embarks on a cycle through IDLE → REPROG → ACTIVE_COMPUTE → IDLE states, controlled by “layer start” and “layer done” tokens from the Network Main Controller (NMC) (Chong et al., 20 Jan 2026).
Global Sliding Window Scheduling: NMC issues paired instruction streams—one triggering PIM compute ops, the other weight-load ops—across CTs with a window depth of 2. Overlapping reprogramming (REPROG(CT_{l+1})) with computation (COMPUTE(CT_l)), provided $T_\text{reprog} \leq T_\text{comp}$ , yields reduced pipeline stalls.
SRAM Bank Activation Masking: Layer-aware bank masking, determined at mapping time based on LoRA matrix allocation, preloads a bank mask register to enable selective gating of unused banks/subarrays.
Bank-Level Gating FSMs: At runtime, banks not required for the upcoming layer are immediately put into retention via SleepB controls. Wake-up delay $\tau_\text{wakeup}$ ensures synchronized activation for data access.

This overlap scheduling paradigm also extends to nvSRAM backup-restore cycles (Monga et al., 2019) and ME-SRAM checkpointing (Najafi et al., 2023), where store and restore signals precede, accompany, or overlap computational transitions, minimizing idle-residence energy.

3. Device and Circuit-Level Power Gating Mechanisms

Device-level SRPG architectures employ power-gating transistors and self-termination circuits to minimize both leakage and dynamic power.

Power-Gate Transistor Placement: PRIMAL inserts header PMOS sleep transistors per SRAM bank (Chong et al., 20 Jan 2026). nvSRAM places a high-µ NMOS sleep transistor as footer (ground) or header (VDD) for each block (Monga et al., 2019). ME-SRAM arrays employ sleep transistors on VDD at sub-bank granularity (Najafi et al., 2023).
Retention Supply & Non-Volatile Elements: VDD_RET remains active in PRIMAL to maintain cell states in retention (Chong et al., 20 Jan 2026). STT-MTJ and MEFET elements in nvSRAM/ME-SRAM allow complete VDD gating, with logic levels recovered via backup-restore circuits.
Self-Write Termination for MTJ Devices: nvSRAM incorporates low-threshold buffer chains that sense node voltage during MTJ-write. When the MTJ switches, a pulse-shaped network discharges the write signal, terminating current flow and optimizing energy (Monga et al., 2019).
MEFET Backup/Restore Protocols: ME-SRAM’s non-volatile operation involves toggling gate voltage to flip AFM order (±100 mV), with sub-0.2 ns backup+restore cycles and $>10^{12}$ endurance (Najafi et al., 2023).

A plausible implication is that future architectures will migrate toward pervasive subarray-level gating and autonomous self-termination detection to reduce both static and dynamic power, even in multi-layer signal processing pipelines.

4. Analytical Models and Performance Metrics

SRPG architecture enables quantifiable reductions in energy and acceleration of compute workflows through formal power and latency models.

Dynamic Power Scaling: For $N$ total banks with $A$ active and $G=N-A$ gated: $P_\text{dyn}(N,A) = P_\text{full}\cdot\frac{A}{N} + P_\text{ret}\cdot\left(1-\frac{A}{N}\right)$ (Chong et al., 20 Jan 2026)
Reprogramming/Compute Overlap: Given word size $S_w$ and per-word programming time $t_\text{prog}$ , reprogram latency is $T_\text{reprog} = S_w t_\text{prog}$ . Compute on $S_c$ words at per-MAC time $t_\text{mac}$ has $T_\text{comp} = S_c t_\text{mac}$ . Overlap factor $\alpha = \min(1, T_\text{reprog}/T_\text{comp})$ , with stall $\max(0, T_\text{reprog}-T_\text{comp})$ .
Layer Energy:

$E_\text{total} = E_\text{compute} + E_\text{reprog} - E_\text{saved\_idle}$ (Chong et al., 20 Jan 2026) where $E_\text{compute} = P_\text{comp} T_\text{comp}$ , $E_\text{reprog} = P_\text{prog} T_\text{reprog}$ , and $E_\text{saved\_idle}$ relates to idle power-domain gating.

Device-level expressions for write energy per bit (STT-MTJ):

$E_\text{write} = V_\text{write} \times I_\text{write} \times t_\text{pulse}$

and for ME-SRAM backup: $E_\text{backup} \approx \frac{1}{2} C_\text{ME} V_\text{pst}^2$ (Najafi et al., 2023)

Area and Timing Overheads:
- Sleep transistor insertion +3% area (SRAM macro),
- Dual-mode drivers +5% bank logic,
- Bank-mask register file $<100\,\mu$ m $^2$ /CT,
- Wake-up latency $\approx 2$ ns (Chong et al., 20 Jan 2026),
- nvSRAM subcircuit reduction –25.8% transistor count, backup write energy –17.9% (Monga et al., 2019),
- ME-SRAM cell area $\approx 1.5\text{--}2\times$ vs 6T, backup+restore $<1.1$ ns (Najafi et al., 2023).

The formalization of these equations is pivotal for prospective modeling of SRPG benefits and trade-offs.

5. Measured Benefits and Device Comparisons

SRPG schemes yield substantial improvements in energy efficiency, throughput, and leakage control in both in-memory compute fabric and nonvolatile memory architectures.

Architecture	Power Savings	Throughput Gain	Area/Latency Overhead
PRIMAL (SRPG) (Chong et al., 20 Jan 2026)	30% system power; 25% energy/inference	+10% effective	+3–5% area; 2 ns wake-up
nvSRAM (Monga et al., 2019)	~100% leakage cut in standby	N/A	–25.8% transistor, <15 ns backup
ME-SRAM (Najafi et al., 2023)	$>100\times$ static power reduction	N/A	$1.5\text{--}2\times$ cell size, 0.16 ns backup+restore

Isolated from other system optimizations, PRIMAL’s SRPG achieves idle-CT power gating savings of up to 80% (leakage $1.2$ W $\rightarrow 0.24$ W per CT), 45% dynamic power reduction in subarray-level gating, pipeline stall reduction (+8% throughput), and overall system power reduction of $\approx 30\%$ in Llama-13B inference (LoRA rank 8) (Chong et al., 20 Jan 2026).

nvSRAM demonstrates a backup write energy reduction from $0.313$ pJ to $0.257$ pJ per bit by leveraging self-write termination, with no degradation in CMOS read/write latency and endurance exceeding $10^{15}$ cycles (Monga et al., 2019).

ME-SRAM attains effective standby power savings (static leakage $>100\times$ below conventional SRAM/PG approaches), accelerates checkpointing operations (0.16 ns total latency), with device-to-architecture energy/latency improvements (read/write PDP, RSNM) over 6T and MRAM nvSRAM designs (Najafi et al., 2023).

6. Architectural Implications and Directions

SRPG enables sophisticated, hierarchical power control and mutable memory mapping in large-scale PIM compute fabrics and nonvolatile caches.

Processing-in-Memory Scalability: SRPG schemes facilitate LoRA model sparsity exploitation by selective activation of subarrays and banks, minimizing commensurate SRAM power footprint during inference and checkpointing (Chong et al., 20 Jan 2026).
Nonvolatile/Normally-Off Memory: The integration of STT-MTJ (nvSRAM) and MEFET (ME-SRAM) architectures enables blocks to be power-gated with persistent data retention, affording full leakage abatement in standby and fast restore at wake-up (Monga et al., 2019, Najafi et al., 2023).
Fine-Grained Resource Utilization: Subarray-level gating and bank masking, as implemented in PRIMAL, provide dynamic adaptation to workload–layer requirements, optimizing both throughput and energy per inference.
System-Level Orchestration: Overlap scheduling and hierarchical power domain control mask reprogramming and wake-up delays within compute pipelines, contributing to sub-linear scaling in both energy and stall time as model size and parallelism increase.

A plausible implication is that as model sizes and heterogeneity increase, DRAM/SRAM fabrics will see even finer-grained, device-aware SRPG schemes, including analog-compute and hybrid retention/gating control.

7. Common Misconceptions and Limitations

SRPG is sometimes construed as synonymous with coarse block-level sleep transitions; however, only schemes leveraging subarray activation, reprogramming-compute overlap, and nonvolatile device integration (STT-MTJ, MEFET) realize the measured reductions in energy, area, and leakage documented above.

Write Latency Overheads: STT-MTJ and MEFET reprogramming operate at sub-15 ns and sub-0.2 ns, respectively; restore typically adds 50–100 ps, which is negligible compared to CPU pipeline idling (Monga et al., 2019, Najafi et al., 2023).
Area Overheads: ME-SRAM cell area increases by 1.5–2× over 6T, but energy and leakage benefits outweigh capacity losses in edge computing and in-situ processing contexts (Najafi et al., 2023).
Power-Gating Wake-Up IR Drop: Sleep transistor sizing in ME-SRAM and nvSRAM must accommodate wake-up droop, but retention biasing and self-termination circuits mitigate error rates (Monga et al., 2019, Najafi et al., 2023).
Compatibility with High-Frequency Access: PRIMAL’s overlap protocol and dual-mode write logic ensure minimal pipeline stalls even at layer boundaries (Chong et al., 20 Jan 2026).

In summary, SRPG approaches, as developed in PRIMAL, nvSRAM, and ME-SRAM, incorporate multi-tiered control logic, device-integrated retention, subarray resolution, and analytic scheduling for comprehensive performance, energy, and resilience benefits.