Matrix-Free (PA) Methods

Updated 20 January 2026

Matrix-free (PA) methods are computational techniques that bypass explicit matrix assembly by recomputing local contributions on the fly, reducing memory and bandwidth requirements.
They leverage sum-factorization and on-the-fly quadrature to achieve high arithmetic intensity and efficient performance in high-order and multiphysics simulations.
Integrating advanced code generation and robust multigrid solvers, these methods enable significant speedups and scalability in solving complex PDEs.

A matrix-free method, often referred to as partial assembly (PA), is a computational strategy in which the global system matrix—ubiquitous in finite element, finite difference, or optimization-based solvers—is never formed explicitly. Instead, only the operator's action on a vector is evaluated by recomputing the necessary local or pointwise contributions on the fly, thereby avoiding the extreme storage and memory bandwidth costs associated with assembling and accessing large global matrices.

1. Concept and Mathematical Principles

Matrix-free methods perform operator application directly, exploiting the bilinear or variational structure present in most PDE discretizations or optimization problems. For a typical finite element problem where the weak form defines an operator $A$ , a matrix-based code computes and stores each entry $A_{ij}$ , then assembles the global sparse matrix and performs $y = A x$ . Matrix-free approaches eschew this: the residual or Jacobian action is recomputed using local quadrature or domain-decomposed contributions, followed by efficient summation (often via cache-friendly routines).

For a weak form

$y_i = \sum_{K \in \mathcal{T}_h} \int_{K} (\nabla u \cdot \nabla \phi_i + u \phi_i) \, dx$

the matrix-free method loops over mesh cells $K$ , computes contributions (using only the required vector data and geometric metrics), and scatters results to the global vector $y$ (Sun et al., 2019).

This general template holds for nonlinear and time-dependent systems, for linear and nonlinear residuals, and extends to operator evaluation in optimization and inverse problems—a notable example being convex cone programs where $A$ is represented as a composition of fast linear transforms instead of an explicit matrix (Diamond et al., 2015).

2. Matrix-Free Algorithms for High-Order Discretizations

Matrix-free methods achieve their most significant computational advantage in high-order (large $p$ ) finite element and spectral discretizations. Key enabling algorithms include:

Sum-Factorization and Tensor-Product Structure: For cuboidal or hexahedral elements, multidimensional shape function evaluation and quadrature can be recursively reduced to 1D operations by exploiting tensor-product bases. The per-element work drops from $O(p^{2d})$ (naïve matvec) to $O(d p^{d+1})$ , with memory and compute costs reduced by orders of magnitude (Müthing et al., 2017, Chang et al., 13 Jan 2026).
On-The-Fly Quadrature Kernels: For nonlinear elasticity or hyperelastic models, the action of the tangent operator at each quadrature point is generated automatically and evaluated on the fly, often using tools like AceGen (symbolic/AD code generation) to avoid explicit formation of fourth-order tensors. This leads to compute-intensity far exceeding that of sparse matrix–vector products and enables high exploitation of SIMD or GPU resources (Wichrowski et al., 21 May 2025).
Partial Assembly on Non-standard Topologies: For unstructured tetrahedral grids or hybrid meshes, matrix-free evaluation proceeds by pre-tabulating basis derivatives and mass/laplacian-like forms, then applying local dense/matrix-matrix contractions, with explicit vectorization and data locality optimizations to reach high node-level throughput (Still et al., 12 Sep 2025, Böhm et al., 2024).

Algorithmic Kernel	Asymptotic Work per Element	Limiting Regime
Full matrix assembly	$O(p^{2d})$	Memory/bandwidth bound
Matrix-free sum-factorized	$O(d\,p^{d+1})$	Compute bound, high AI

*AI: arithmetic intensity

3. Implementation Strategies and Code Generation

Matrix-free PA methods rely on sophisticated software architecture for optimal efficiency. Salient strategies include:

Automatic Code Generation: Use of symbolic or automatic differentiation tools (e.g., AceGen, Enzyme, Firedrake’s TSFC/Loopy, HyTeG Operator Generator) which generate local quadrature kernels, possibly fusing loops and exploiting model-specific symmetries for optimal hardware utilization (Wichrowski et al., 21 May 2025, Sun et al., 2019, Böhm et al., 2024).
Data Layout and Vectorization: Inter-element vectorization (dispatching SIMD lanes across cells), structure-of-arrays (SoA) memory alignment, blocking strategies to maximize cache reuse, and careful ordering of assembly/evaluation loops (such as cubes/block-strategies for tetrahedral grids) all elevate arithmetic intensity and minimize memory bandwidth (Sun et al., 2019, Böhm et al., 2024).
Cross-Element and Cross-Kernel Fusion: Loop-fusing all local operations (geometry, metric evaluation, constitutive law, test/trial contractions) into a single kernel drastically improves temporal data locality compared to kernel-splitting with global intermediates (Chang et al., 13 Jan 2026). Macro-kernel fusion has led to order-of-magnitude kernel speedups (e.g., up to $83\times$ in 3D elasticity for $p=8$ (Chang et al., 13 Jan 2026)).

Code-Gen/Optimization	Example Tool/Method	Impact
Loop fusion	Macro-kernel PAop	7–83x kernel speedup ( $p=8$ )
Cross-element SIMD	Firedrake, HOG	$>50$ \% peak, 2–11x vs. scalar
Symbolic AD	AceGen, Enzyme	Zero-diff overhead, maintenance-free

4. Matrix-Free Solvers and Preconditioning

Matrix-free operator application is most impactful when coupled with solvers that do not require an explicit matrix:

Krylov Subspace Methods: GMRES, BiCGSTAB, and CG can be implemented in a strictly matrix-free manner, requiring only operator-vector and preconditioner-vector products (Benedusi et al., 2021, Geranmayeh, 2017). PETSc, Trilinos, and related packages expose shell/implicit matrix APIs to facilitate this.
Matrix-Free Multigrid: Geometric and $p$ -multigrid hierarchies can be constructed by recomputing the local operator at each level with (possibly) different quadrature or mesh realizations; the smoothers (e.g., Chebyshev-Jacobi, matrix-free ILU surrogates) are implemented via direct, pointwise actions (Jodlbauer et al., 2019, Brown et al., 2022, Drzisga et al., 2022, Schussnig et al., 2024, Jodlbauer et al., 2022).
Matrix-Free Preconditioning and Smoothers: For structured or block-diagonal components (e.g., block-Jacobi, Chebyshev, ILU(0) on local patches), surrogate polynomial approximations and direct evaluation replace full storage. For domain decomposition (e.g., Schur-complement/Steklov-Poincaré methods), on-the-fly ILU(0) surrogates on macro-elements provide efficient, memory-lean local smoothers (Drzisga et al., 2022).

Solver/Preconditioner	Matrix-Free Integration Strategy
Krylov (CG, GMRES, etc.)	Operator via mat-vec API only
Geometric or $p$ -multigrid	On-the-fly local operator, transfer op
Smoothers (Jacobi, ILU)	Diagonal/MF-ILU surrogate, Chebyshev

5. Quantitative Performance and Scaling

Extensive benchmarks demonstrate that modern matrix-free implementations, properly optimized, can attain:

High Arithmetic Intensity and Node-Level Peak: Achieving $30\%$ – $70\%$ of theoretical node peak for high-order element operators, often outpacing assembled mat-vec by $10\times$ – $100\times$ as soon as $p\geq3$ in 3D (Müthing et al., 2017, Wichrowski et al., 21 May 2025, Chang et al., 13 Jan 2026, Böhm et al., 2024, Still et al., 12 Sep 2025).
Memory Reduction: Memory per DoF for storing operator data drops from $>100$ doubles/DoF (sparse matrix) to $<1$ double/DoF, expanding feasible problem size by an order of magnitude (Wichrowski et al., 21 May 2025, Brown et al., 2022).
End-to-End Solver Speedup: End-to-end time-to-solution speedup for large-scale, high-order problems often exceeds $10$x—e.g., $80$x faster for 3D nonlinear elasticity at $p=4$ (Newton+CG), $17$–$28$x for $hp$ -MG preconditioned hyperelastic simulations (Wichrowski et al., 21 May 2025, Schussnig et al., 2024).
Strong/Weak Scaling: Perfect or near-perfect parallel efficiency on thousands of cores, especially as arithmetic intensity lifts compute-bound kernels out of the memory-bound regime. Extreme-scale solves (e.g., $10^{12}$ DoF in $<50$ s on $>20,000$ ranks) have been demonstrated for hybrid tetrahedral grids (Böhm et al., 2024).

Metric	Matrix-Based (SpMV)	Matrix-Free (PA)
FLOP/Byte (3D, $p=3$ )	$\sim 0.15$	$4.8$–$17$
Memory per DoF	$100$–$1000$ doubles	$0.5$–$3$ doubles
Solver speedup ( $p=3$ –$5$)	baseline	$5$– $30\times$
Node throughput	$0.5$–$2$ MDoF/s	$1$–$2$ GDoF/s

6. Application Domains and Extensions

Matrix-free/PA methods are now foundational in several domains:

Nonlinear/Finite-Strain Elasticity: Automatic code generation with AD enables robust, high-performance Newton–Krylov solvers with per-quadrature-point tangent actions, outperforming matrix-based algorithms for complex constitutive laws (Wichrowski et al., 21 May 2025, Schussnig et al., 2024).
DG and Hybrid Discretizations: High-order DG methods, CutFEM, and unfitted/ghost penalty stabilization can all be implemented efficiently with matrix-free local quadrature; new tensor-product-based PA algorithms yield $O(k^{d+1})$ complexity for stabilization terms (Wichrowski, 28 Feb 2025, Bergbauer et al., 2024).
Optimization and Inverse Problems: Fast transform (FFT, convolutional, Kronecker product, etc.) based operators are natively modeled and solved matrix-free via composition DAGs (Forward-Adjoint Oracle), enabling scaling to $n>10^7$ (Diamond et al., 2015).
Multiphysics and Multiscale: PA techniques extend to coupled PDE systems (e.g., Stokes, hyperelasticity, phase-field/fracture), incompressible flows (Uzawa/Chebyshev-Jacobi smoothed multigrid), and time-dependent or implicitly-coupled multiphysics (Jodlbauer et al., 2022, Jodlbauer et al., 2019).

Future directions include mixed-precision strategies (single/double-precision inner smoothers, double-precision outer solves), further exploitation of problem structure (Voigt symmetry, low-rank blocks), and performance portability across CPU, GPU, and emerging accelerators (Chang et al., 13 Jan 2026, Brown et al., 2022, Schussnig et al., 2024).

7. Best Practices and Guidance

Matrix-free methods are universally advantageous for high-order ( $p\ge 3$ ) and multi-dimensional ( $d\ge 2$ ) problems; even for $p=1$ in large-scale, parallel environments PA is often competitive (Wichrowski et al., 21 May 2025).
For unfitted and DG discretizations, sum-factorization and tensor-structure must be systematically exploited; in low-order or non-tensor-product settings, dense-matrix kernels and cross-element vectorization can still yield $>50\%$ peak throughput (Still et al., 12 Sep 2025, Sun et al., 2019, Böhm et al., 2024).
Hardware trends—CPU/GPU flops outpacing bandwidth—prioritize compute-bound, on-the-fly local kernels over storage/caching precomputation except for extremely bandwidth-sensitive architectures, especially in 3D (Chang et al., 13 Jan 2026, Bergbauer et al., 2024).
Integration with robust multigrid preconditioners (geometric, $hp$ , hybrid, or algebraic for coarse levels) is essential for scalable, robust solver performance.
Automatic code generation facilitates rapid development for new physics, maintains maintainability, and provides hardware adaptability without sacrificing performance.

In summary, matrix-free (PA) methods are now a mature, central technology driving the efficiency, scalability, and extensibility of modern high-order, high-dimensional, and multiphysics computational science and engineering applications across academic and applied domains (Wichrowski et al., 21 May 2025, Müthing et al., 2017, Schussnig et al., 2024, Chang et al., 13 Jan 2026, Böhm et al., 2024, Diamond et al., 2015).