Systolic Array Architectures
- Systolic Array Architectures are grid-based arrangements of processing elements that rhythmically perform parallel computations to accelerate matrix-heavy AI workloads.
- Innovative dataflows such as Weight-Stationary, Row-Stationary, TrIM, and DiP optimize throughput, reduce memory accesses, and improve energy efficiency.
- Modern designs integrate in-array nonlinear functions, support diverse models like CNNs and Transformers, and incorporate fault-tolerant reconfigurability for robust performance.
A systolic array (SA) is a spatial architecture composed of a grid of tightly-coupled processing elements (PEs) that rhythmically compute and exchange data with minimal global control and optimal local reuse. SAs are widely deployed for the energy-efficient acceleration of matrix-heavy workloads such as deep neural networks (DNNs), especially in AI inference. Their primary advantage is high spatiotemporal data reuse achieved by patterning computation and communication into regular, predictable dataflows that maximize throughput and minimize costly memory accesses between computing cores and main memory. Systolic array design is an active area of research, with substantial focus on dataflow optimization, power and area efficiency, flexibility for different model families (CNN, RNN, Transformer, SSM, KAN), support for sparsity, in-array nonlinear computation, pipelining innovations, and reliability mechanisms.
1. Systolic Array Architecture and Dataflows
SAs are typically organized as 2D meshes of PEs, each containing a multiplier, adder, local buffers or registers, and routing logic for operand and partial sum (psum) movement. The architecture is parameterized by the physical array size (e.g., or ), topology (grid, diagonal, multi-pod), and the nature of data movement, known as the dataflow.
Fundamental dataflows include:
- Weight-Stationary (WS): Weights remain fixed in each PE; activations and psums flow through the array, optimizing for weight reuse but often incurring redundant input fetches (notably in convolution). WS is preferred in highly reusable scenarios, but can suffer from high memory traffic after lowering convolutions to matrix-multiplication equivalents, due to redundant reads (Sestito et al., 2024).
- Row-Stationary (RS): Inputs and weights are streamed and partially cached; intermediate psums are accumulated locally. RS often requires large on-chip scratchpads (SRAMs), leading to high area and energy cost.
- Output-Stationary (OS): Each PE accumulates a complete output before passing it out, minimizing psum movement.
- Advanced Dataflows: Recent work introduces specialized flows such as TrIM’s triangular input movement (Sestito et al., 2024), DiP’s permuted weight-diagonal input (Abdelmaksoud et al., 2024), and programmable dataflows for SSM models (ProDF) (Raja et al., 29 Jul 2025). These aim to optimize for tensor sparsity, reduce memory access, minimize synchronization or pipeline stalls, and tailor to emerging AI workloads.
PEs typically support integer (INT8/16/3-bit) or reduced-precision floating-point arithmetic, with modern variants including support for energy-efficient exact and approximate modes (Jaswal et al., 31 Aug 2025), and even direct nonlinear function computation (Sun et al., 2024).
2. Dataflow Innovations, Memory Efficiency, and Throughput
Classical SAs treat matrix multiply as a bulk operation with relatively high initial latency (array fill time), but nearly perfect utilization at scale. Nonetheless, performance, memory traffic, and power are highly sensitive to dataflow and pipeline choices.
- Triangular Input Movement (TrIM): TrIM (Sestito et al., 2024, Sestito et al., 2024) maximizes input reuse for CNN convolution by shifting new inputs in right-to-left and then diagonally up across a SA, enabled by per-row shift-register buffers. Each input is reused up to times, dramatically reducing off-chip memory accesses—admitting 10 fewer reads than WS and up to 16 fewer than RS, with up to 1.818 higher throughput per PE than RS and 15.6 fewer registers required. This is particularly effective for small convolution kernels (e.g., ), large input FMAPs (), and high compute utilization (Sestito et al., 2024).
- Diagonal Input Permutated (DiP) Dataflow: DiP (Abdelmaksoud et al., 2024) eliminates input/output synchronization FIFOs by injecting data diagonally and pre-permuting weights. This enables 33–49% higher throughput, up to 1.93 better energy efficiency per area, and up to 1.49 lower latency on transformer workloads, with up to 20% reduction in register overhead for boundary FIFOs versus WS arrays.
- In-Array Im2col and Bi-Directional Propagation (Axon): Axon (Nayan et al., 10 Jan 2025) injects operands along the principal diagonal and uses bi-directional propagation, halving fill latency and enabling efficient, low-overhead in-hardware im2col for GEMM-based convolution. This achieves up to fill speedup and energy reduction.
- Sparse and Structure-Aware Dataflows: Architectural extensions such as VUSA (Helal et al., 1 Jun 2025) and Sense (Sun et al., 2022) efficiently exploit unstructured or blockwise sparsity by dynamically upsizing or clustering workloads to match nonzero distributions. VUSA achieves 37% area and 68% power reduction (16 GOP/s vs. 17.2 GOP/s at 85% sparsity) by mapping computations to a reduced set of physical MACs while preserving logical array width.
- Multi-Mode and Flexible Arrays: FlexSA (Lym et al., 2020) and ArrayFlex (Peltekis et al., 2022) support dynamic reconfiguration of the array layout (splitting, merging, or collapsing pipeline stages) to optimize for pruned models, varying tile sizes, and layer-wise latency/throughput/energy trade-offs. FlexSA increases utilization by 37% and reduces energy by 28% relative to naively split arrays, while ArrayFlex achieves up to 1.8 energy-delay product improvements.
3. Power, Area, and Energy Efficiency Techniques
Modern SA designs prioritize power and area efficiency due to deployment in energy-constrained edge and data center environments.
- In-Array Data, Register, and Buffer Minimization: TrIM (Sestito et al., 2024) and DiP (Abdelmaksoud et al., 2024) eliminate large local SRAMs, FIFOs, and synchronize with minimal shift registers, reducing area and energy.
- Precise and Approximate Arithmetic: Custom PE designs using efficient adder/multiplier trees or approximate arithmetic (PPC/NPPC) yield 22–32% energy savings with negligible accuracy impact for image-processing tasks (Jaswal et al., 31 Aug 2025).
- Bus-Invert Coding and Zero-Value Clock Gating: Selectively encoding high-activity mantissa fields of weights and exploiting zeros in activations reduce switching activity by ~29%, yielding up to 9.4% overall power savings in common CNNs such as ResNet-50 (Peltekis et al., 2023).
- Fine-Grained Pipelining: Deep and transparent pipelining enables higher sustained clock rates and throughput without increased area (e.g., 400 MHz 3-bit integerized transformer arrays at 219 GOPs/s/W (Lin et al., 28 Aug 2025)), balanced with layer-specific pipeline depth adjustment (ArrayFlex) for optimal energy-delay product.
- Structured Vectorization: Systolic Tensor Arrays (STA) (Liu et al., 2020) internally vectorize PEs (“Tensor-PEs”), amortizing register and logic overhead, leading to 2.08 area and 1.36 power reduction in dense mode and up to 3.14 and 1.97 with block-structured sparsity support (DBB).
4. Advances in Functionality: Nonlinear, State-Space, and Emerging Model Support
Recent SAs move beyond classic matrix-multiplication to support new classes of models and on-array nonlinear computation.
- In-Array Nonlinear Functionality: ONE-SA (Sun et al., 2024) integrates nonlinear operations (e.g., ReLU, piecewise-linear approximations) via in-place Hadamard and IPF units. This yields throughput-per-watt improvements up to 25.7 over CPUs and up to 135.8% of model-specific FPGA SAs, handling DNNs with negligible loss (3%) in accuracy.
- State-Space Model (SSM) Acceleration: EpochCore (Raja et al., 29 Jul 2025) introduces the LIMA-PE, supporting both traditional MAC, elementwise recurrence (FRI), and time-varying (TRI) integration. Coupled with programmable ProDF dataflow, this structure achieves three orders of magnitude acceleration and two orders of magnitude energy savings on long-sequence SSMs compared to prior SAs and GPUs.
- Spline-Based and Structured-Nonlinear Networks: KAN-SAs (Errabii et al., 20 Nov 2025) enable acceleration of Kolmogorov–Arnold networks by embedding tabulated, one-cycle nonrecursive B-spline units and vector PEs that leverage : structured sparsity, raising utilization from 30% to near-100% and halving inference time.
5. Reliability, Reconfigurability, and Robustness
As SAs are increasingly deployed in safety-critical contexts, reliability-aware architectural enhancements are required.
- Run-Time Fault Tolerance: FORTALESA (Cherezova et al., 6 Mar 2025) demonstrates a reconfigurable architecture that offers on-demand spatial redundancy (dual- or triple-modular) per-layer, balancing area, throughput, and Architectural Vulnerability Factor (AVF). Reconfigurable modes deliver up to 3 speedup versus static TMR, with 6 less resource overhead (area ratio 1:6), and AVF reduction by up to 2 (DMR) or to zero (TMR).
- Reliability Assessment Tools: SAFFIRA (Taheri et al., 2024) introduces an analytical, hardware-aware, hierarchical fault-injection framework for DNN SAs. It allows quantification of error propagation, SDC rates, FIT/MTTF metrics, and guides design of error-detection and correction codes—favoring 16-bit quantization for lower AVF and recommending targeted accumulator protection.
- Adaptive Mapping and Scheduling: SOSA (Yüzügüler et al., 2022) and FlexSA combine physical array pod-sizing and run-time mode-switching with layer- and workload-aware mapping to maximize effective throughput per watt and maintain robust utilization under load and pruning-induced irregularities.
6. Scalability, Integration, and System-Level Considerations
System-level organization—scalability across array and chip, interconnect, and tiling strategies—has a first-order impact on real-world utilization and throughput.
- Pod-Based and Multi-Array Scaling: SOSA (Yüzügüler et al., 2022) identifies 32×32 pods as optimal for single- and multi-tenancy, outperforming 128×128 commercial designs by up to 1.5 in effective throughput per watt. 256 pods at this granularity yield 317 TOPS at 40% utilization for modern DNNs, with efficient Butterfly-2 interconnect topologies.
- Hierarchical Buffering: Three-level buffer hierarchies (global shared, local row/column, per-PE) balance memory latency, bandwidth, and reuse, minimizing external data movement.
- Hardware/Software Co-design: FlexSA’s compiler tiling matches pruning patterns and tile sizes to mode-switching hardware, maximizing utilization and reuse, while Sense (Sun et al., 2022) co-designs pruning schedules and channel clustering for online balance across sparsity patterns.
7. Comparative Analysis and Design Guidelines
Emerging designs emphasize the following trends:
- Maximizing Data Reuse: Local shift-registers, diagonally or triangularly patterned flows (TrIM/Axon/DiP), and vectorization (STA) are efficient substitutes for on-chip SRAM, providing high compute intensity and reduced DRAM traffic.
- Sparsity Exploitation: Virtual upscaling, clustering, uniform-pruned DBBs, and : vector PEs match computation to nonzero distribution, yielding significant gains in area, power, and utilization for structured and unstructured sparse models.
- Energy-Delay-Product and Latency: Fine-grained and transparent pipelining, approximate compute logic, and workflow-aware dataflows (TrIM, DiP) deliver substantial reductions in EDP and application latency, especially in high-throughput, scale-out settings.
- Configurability and Model Versatility: Multi-mode arrays (FlexSA, ArrayFlex), nonlinear PE capabilities (ONE-SA), and programmable dataflows (EpochCore) ensure broad applicability across model classes, from standard CNNs to state-space models and spline-based architectures.
- System-Level Robustness: Fault-tolerant mappings (FORTALESA), reliability assessment (SAFFIRA), and dynamic reconfiguration (SOSA) are increasingly critical as array sizes and deployment scale rise.
The current landscape demonstrates that the major research focus is on removing synchronization bottlenecks, maximizing local reuse, and tightly coupling architecture and dataflow design to model, sparsity, and workload structure, while minimizing area, power, and reliability overheads. Systolic array research is thus tightly interlinked with model co-design, process technology, compiler toolchains, and application requirements (Sestito et al., 2024, Abdelmaksoud et al., 2024, Raja et al., 29 Jul 2025, Yüzügüler et al., 2022, Sun et al., 2022, Nayan et al., 10 Jan 2025, Errabii et al., 20 Nov 2025).