Matrix Processing Units (MPUs)

Updated 17 January 2026

Matrix Processing Units (MPUs) are specialized hardware accelerators optimized for executing small, dense matrix operations using dedicated systolic arrays.
They integrate with CPUs, GPUs, and ASICs to accelerate deep learning, scientific simulations, and edge-AI workloads through custom ISAs and dataflow techniques.
MPUs leverage mixed precision, tailored architectures, and innovative algorithms to maximize throughput while managing data movement and energy efficiency.

Matrix Processing Units (MPUs) are specialized hardware accelerators designed to execute small, dense matrix operations—especially matrix-matrix multiplications—at high throughput and efficiency, using dedicated microarchitectures that are distinct from general-purpose CPUs, classical vector units, and large-scale NUMA (Non-Uniform Memory Access) systems. These units, present in both embedded and high-performance domains, underlie the performance of modern deep learning, scientific simulation, and edge-AI workloads, and are now integrated across CPUs, GPUs, and custom ASICs. MPUs enable not only acceleration of canonical General Matrix-Matrix Multiplication (GEMM) but also a spectrum of computational primitives via architectural and algorithmic reformulation, as evidenced by their widespread utilization in AI, distributed HPC, and domain-specific scientific simulations.

1. Architectural Principles and Microarchitectural Design

MPUs are typically realized as compact, systolic arrays of multiply-accumulate (MAC) processing elements (PEs), often 2D grids such as $4\times4$ or $16\times16$ tiles, optimized for regular weight-stationary or output-stationary dataflows. Input matrices are partitioned into fixed-size tiles, which are loaded into dedicated register files (e.g., the Arm SME's ZA register or Quadrilatero's Matrix Register File) or local on-chip SRAM. The core computational unit for most commercial MPUs is a highly parallel matrix-multiply-accumulate datapath that fuses multiple FMAs per cycle—e.g., Apple's M4 SME issues a $16\times16$ FP32 outer-product (512 flops) per instruction, and NVIDIA Tensor Cores implement $4\times4\times4$ tiling with FP16 $\to$ FP32 accumulation (Domke et al., 2020, Remke et al., 2024, Cammarata et al., 10 Apr 2025).

These systolic or outer-product pipelines are tightly integrated near the memory controller and cache hierarchy. The units expose custom ISA extensions—such as FMOPA for ARM SME, mmac in Quadrilatero, or matrix-oriented instructions in Intel AMX—providing single-instruction throughput for entire tiles. Hardware often enables double-buffered or pipelined operand loading, leveraging local register file multi-porting to sustain maximal pipeline occupancy under tight area and energy constraints (Remke et al., 2024).

Notably, MPUs can be tailored for mixed precision (e.g., INT8, FP16, BF16, FP32), integer-only computation (IMMUs), or even processing-in-memory (PIM) paradigms leveraging memristive crossbars for in-situ MACs (Ootomo et al., 2023, Leitersdorf et al., 2022). The choice of architecture critically affects throughput, energy-per-op, and suitability for high-precision (FP64) workloads.

2. Algorithmic Mapping and Theoretical Models

MPU-centric architectures inspire new algorithmic paradigms that exploit the unit's characteristics. The MMV-RAM model formalizes a system with both an AC $^0$ -restricted vector unit (VCU) and an MPU (MMU), showing that primitives such as segmented prefix-scan, segmented sum, and elementwise vector multiplication achieve optimal $O(\log_s n)$ depth via recursive block decomposition and speculative block-level scans, compared to circuit lower-bounds of $\Omega(\log n/\log\log n)$ for pure-vector methods (Sobczyk et al., 30 Jun 2025). This result highlights the depth advantage for block-wise matrix units, when combined with AC $^0$ vector logic.

Computation patterns with high arithmetic intensity (AI), such as GEMM, QR factorization, and matrix-function evaluation (polynomial iterations), are especially well suited for MPUs, which can saturate theoretical throughput when the input is appropriately tiled and fed from high-bandwidth on-chip or stacked memory. The roofline model provides upper bounds for achievable performance as $P_{\rm peak}$ and $B_{\rm peak}\times \text{AI}$ —MPUs achieve the compute roof for high-AI kernels but are bandwidth-bound for memory-limited workloads (Domke et al., 2020).

In the PIM context, mMPUs implement matrix-vector and convolutional primitives in-place via row/partition parallelism, leveraging memristive stateful logic and crossbar arrays to minimize off-array data movement. Algorithms exploit block partitioning, parallel reduction trees, and popcount-based circuits for efficient binary MVM and convolution (Leitersdorf et al., 2022).

3. Software Ecosystem, ISA Extensions, and Practical Programming

MPUs are surfaced to software via dedicated ISA extensions, low-level libraries, and JIT microkernel generation frameworks. Key examples include:

ARM SME (Scalable Matrix Extension): FMOPA (floating-point matrix outer-product accumulate), BFMOPA (BF16), SMOPA (INT8), with a hierarchical ZA register array and explicit control of tile partitioning and predication (Remke et al., 2024).
RISC-V (Quadrilatero): exposes mmac, mz, mld.w, mst.w for matrix zeroing, load, store, and MAC; tiles are managed through a tight coupling of controller, LSU, permutation unit, and systolic array (Cammarata et al., 10 Apr 2025).
Intel AMX, IBM Power10 MMA, NVIDIA cuBLAS/cublasGemmEx, or custom microkernels JIT-assembled for small GEMMs.

The programming interface typically falls into two categories: (1) callable library routines (e.g., cuBLAS, MKL) which auto-dispatch to MPU/matrix units when appropriate; (2) explicit vector-matrix code utilizing intrinsics or assembly, often facilitated by JIT code generation for shape-specific microkernels, as seen in LIBXSMM and “Hello SME!” (Remke et al., 2024). Developers must consider tile size, register bandwidth, scratchpad fit, and instruction scheduling for maximal utilization.

Matrix ISAs are now expanding to support sparsity-tailored instructions, gather/scatter, and “densifying” fields to map irregular accesses and boost PE utilization for sparse GEMM and DNNs. DARE’s densifying ISA and filtered runahead execution exemplify co-designed ISA/microarchitecture solutions for irregular, memory- and compute-bound regimes (Yang et al., 19 Nov 2025).

4. Application Domains and Performance Impact

MPUs’ dominant use case is dense GEMM, enabling state-of-the-art throughput in deep neural networks (CNNs, Transformers, RNNs), dense linear algebra (HPL, LAPACK QR/solve), and matrix-centric scientific kernels including particle-in-cell (PIC), stencil, and PDE solvers (Domke et al., 2020, Lewis et al., 2021, Rao et al., 13 Jan 2026, Zhao et al., 2023). System-level implications are profound: in DNNs with large GEMM fractions, system speedups reach $1.2\times$ – $1.3\times$ . For classic HPC, where GEMM is a modest fraction ( $\sim$ 3.5%) of wall time, the impact is smaller—MPUs are idle much of the time due to Amdahl-limited speedup (Domke et al., 2020).

Cutting-edge codes, such as MatrixPIC, demonstrate how non-GEMM scientific kernels can be algorithmically reformulated—e.g., expressing scatter-adds as block matrix outer-products mapped to MOPA units, paired with VPU-based control and light-weight incremental sorting structures for data localization. In these co-designed pipelines, >80% of theoretical peak efficiency is observed, with up to $2.8\times$ faster kernel runtime over CUDA kernels, and 2–9x end-to-end speedups for complex kernel chains (Rao et al., 13 Jan 2026). High-order stencils are mapped to outer-product based pipelines yielding 2–4x over SIMD, given sufficient register and cache utilization (Zhao et al., 2023).

In edge/IoT scenarios, design constraints emphasize sub-mm $^2$ area, single-cycle FPU/MPU integration, and energy-per-MAC. Quadrilatero achieves 99.4% FPU utilization and up to 77% higher area efficiency than comparable vector designs, making such units practical for embedded AI workloads under severe power budgets (Cammarata et al., 10 Apr 2025).

For large-scale distributed linear algebra, Google TPUs (viewed as MPUs) deliver O(20 PFLOPS) on petascale matrix multiplications using MXU-bound SUMMA algorithms and 2D interconnects—where the block checkerboard mapping of panels to MXUs ensures compute-bound scaling for $N\gg\sqrt{P} \text{block-size}$ , with communication complexity $O(N^2/\sqrt{P})\ll O(N^3/P)$ (Lewis et al., 2021).

5. Specialized MPU Classes: Integer and In-memory Processing

MPUs are increasingly typified as integer-only matrix engines (IMMUs) in AI hardware. These units generalize sparse/dense INT8/MMMA accelerators for low-precision inference (e.g., quantized deep learning) but, through error-free splitting techniques (Ozaki scheme), can perform high-precision FP64 matrix multiplications and matrix-vector products at 3–6x the throughput of classical FP64 DGEMM, with IEEE-compliant accuracy (Ootomo et al., 2023). The precision-performance tradeoff is governed by the number of splits and accumulator mantissa bits.

Memristive MPUs (mMPUs) unify storage and logic with stateful gate arrays. Blocked partitioning, in-place popcount accumulations, and crossbar partitioning enable order-of-magnitude speedups for binary and full-precision MVM/convolution, with energy and area benefits that scale to high problem sizes by stacking/partitioning arrays (Leitersdorf et al., 2022).

6. Limitations, Challenges, and Evolving Directions

Despite their architectural efficiency, MPUs are limited by input arithmetic intensity, data-movement bottlenecks (memory-bound cases), and the lack of cross-vendor portability in ISA and kernel compilation. A system with $<5\%$ wall-time in GEMM sees <3% speedup, highlighting the importance of algorithmic affinity. “Dark silicon” (power envelope competition between FPU and MPUs) and hardware/software co-design barriers persist (Domke et al., 2020).

Sparse and irregular workloads require new ISAs (densifying opcodes), scheduling techniques (filtered runahead), and tailored kernel mapping (coalescing strided accesses) to achieve high PE utilization and prefetching efficiency, as in DARE, which achieves 1–4 $\times$ performance gains and up to 22.8 $\times$ energy improvements in sparse DNN kernels (Yang et al., 19 Nov 2025).

The domain of quantum simulation, and mathematical tensor networks, also leverages matrix-product unitaries (MPUs) for representing and evolving 1D quantum chains, with deep connections to quantum cellular automata, graded tensor networks, and classification via index theorems (Piroli et al., 2020).

7. Summary Table: Representative MPU Architectures

System	Tile Size / PE	Precision	Peak Throughput	Key Features	Ref.
NVIDIA TC	$4\times4\times4$	FP16→FP32	312 Tflop/s [A100]	Hybrid-precision MAC, systolic engine	(Domke et al., 2020)
Apple M4 SME	$16\times16$ (FP32)	FP32, INT8, BF16	$>2.3$ TFLOPS	FMOPA, ZA matrix reg, two-step load/store	(Remke et al., 2024)
Quadrilatero	$4\times4$	FP32	2.24 GFLOPS	RISC-V ISA, <1 mm $^2$ , $>99\%$ utilization	(Cammarata et al., 10 Apr 2025)
INT8 IMMU	$16\times16$	INT8 $\to$ INT32	$>6$ TFLOPS/core	Error-free partitioned FP64 via Ozaki	(Ootomo et al., 2023)
TPU v3 MXU	$128\times128$	FP32	21 PFLOPS [2048 cores]	Large-scale distributed linear algebra	(Lewis et al., 2021)
MatPIM mMPU	Crossbar, P-partition	Full/bit-prec	$12-39\times$ baseline	In-memory MAC, popcount, tree reduction	(Leitersdorf et al., 2022)

MPUs thus span a spectrum from embedded AI to exascale HPC, with architectures tuned for area, energy, throughput, or irregularity tolerance as dictated by target workloads. Their continuing evolution is driven by advances in microarchitecture, co-designed ISAs, compiler frameworks, and cross-domain algorithmic innovation.