Tensor-Decomposition Block Multiplication

Updated 27 January 2026

Tensor-decomposition-based block multiplication is a method that leverages CP decomposition of the matrix multiplication tensor to decouple block operations and reduce computational complexity.
It uses low-rank and sparsification techniques to lower arithmetic costs, enhance data locality, and enable recursive schemes with sub-cubic performance.
The approach generalizes to arbitrary block shapes and structured formats, offering efficient implementations for scientific computing and high-dimensional data applications.

Tensor-decomposition-based block multiplication refers to algorithms that accelerate or structure matrix multiplication at the block or submatrix level by leveraging low-rank or structured decompositions of the matrix multiplication tensor. These techniques generalize the philosophy of Strassen-type bilinear algorithms—from the basic matrix-matrix product to arbitrary block shapes, higher-order tensors, and even non-square or structured (e.g., triangular) formats. The central tool is the canonical polyadic (CP) decomposition of the matrix multiplication tensor, which allows the computation of block products via sequences of smaller, decoupled multiplications, often with enhanced sparsity, improved data locality, and reduced arithmetic complexity, particularly when applied recursively.

1. Matrix Multiplication Tensor and Its CP Decomposition

Matrix multiplication for blocks $X \in \mathbb{R}^{m \times n}$ and $Y \in \mathbb{R}^{n \times p}$ can be equivalently described as a bilinear map

$\varphi: \mathbb{R}^{mn} \times \mathbb{R}^{np} \rightarrow \mathbb{R}^{mp}, \quad \varphi(\text{vec}\,X, \text{vec}\,Y) = \text{vec}\,(XY)$

encoded by a third-order tensor $T_{m,n,p} \in \mathbb{R}^{(mn)\times (np) \times (mp)}$ with

$T_{m,n,p}[(i-1)n + j,\, (j-1)p + k,\, (k-1)m + i] = 1$

for all $i=1,\ldots,m$ , $j=1,\ldots,n$ , $k=1,\ldots,p$ . All other entries are zero.

A rank- $R$ canonical polyadic (CP) decomposition is given by

$T_{m,n,p} = \sum_{r=1}^R a_r \otimes b_r \otimes c_r$

with factor matrices

$A = [a_1\,\dots\,a_R] \in \mathbb{R}^{(mn)\times R}, \quad B = [b_1\,\dots\,b_R] \in \mathbb{R}^{(np)\times R}, \quad C = [c_1\,\dots\,c_R] \in \mathbb{R}^{(mp)\times R}.$

This enables the block-matrix multiplication to be computed as

$\text{vec}\,Z = C \left[(A^\top x)\,*\,(B^\top y)\right]$

where $x = \text{vec}(X)^\top$ , $y = \text{vec}(Y)^\top$ , and $*$ is elementwise multiplication (Tichavsky, 2021).

2. Signature Vectors, Decomposition Classification, and Sparsification

Key to the study of CP decompositions for block multiplication is the concept of the decomposition signature $s \in \mathbb{R}^R$ , which aggregates how much each rank-one component “covers” the target tensor:

$s_r = \mathbf{1}^\top [F(:,r) * C(:,r)], \quad F(:,r) = \text{vec}(B(:,r)\,A(:,r))^\top$

with the total sum $\sum_{r=1}^R s_r = mnp$ . The signature is invariant under De Groote group transformations (invertible slice-wise basis changes), classifying decompositions up to these equivalences (Tichavsky, 2021). This is crucial for identifying genuinely distinct algorithms.

Sparsification and integerization via De Groote transforms or signature-enforced ALS make factor matrices sparser and entries smaller, reducing arithmetic operations and improving cache/TLB locality during block kernel executions (Tichavsky, 2021).

3. Algorithms: Block Computation via CP and Kronecker Structures

Given a CP decomposition of the multiplication tensor, the multiplication is carried out by:

Computing $u = A^\top x$ (for the left block) and $v = B^\top y$ (for the right block).
Elementwise multiplying $w = u * v$ (Hadamard product).
Recovering the result as $z = C w$ .

In practical scenarios, such as block kernels for $3 \times 3$ by $3 \times 6$ multiplication, sparse integer-weighted $A$ , $B$ , $C$ matrices yield highly efficient computation with precomputed and compressed storage. For sufficiently large matrices, recursive application using block partitioning admits sub-cubic asymptotic complexity with exponent $\omega < 3$ (Tichavsky, 2021).

The Kronecker-CP (“KCP”) decomposition further exploits block structure by hierarchically combining CP decompositions over Kronecker products. Fast algorithms contract the tensor factors either strictly in sequence (Algorithm 1) or in a grouped/pairwise way (Algorithm 2), achieving formal arithmetic costs competitive or superior to traditional TT, BT, TR, or HT decompositions for high-dimensional blocks (Wang et al., 2020).

4. Theoretical and Computational Complexity

The naive $m \times n \times p$ block matrix multiply performs $mnp$ multiplies and order $mnp$ adds. CP-based schemes have operation counts given by the nonzero patterns in $A$ , $B$ , $C$ :

For the $3\times 3$ by $3\times 6$ rank-40 kernel, approximately $45\%$ of arithmetic is saved versus the full dense computation, even before recursion.
When used recursively, block schemes yield sub-cubic complexity (e.g., $\omega \approx 2.81$ for Strassen-type recursions; $\omega = \log_2 7 \approx 2.807$ for the original $2 \times 2$ Strassen kernel).

For evenly blocked prime matrices satisfying block commutativity, the “ $T_1$ -decomposition” reduces computational cost from $O(n^3)$ to $O(n^{5/2})$ , provided the structure matches the algebraic criteria (Wang, 2024).

In structured block multiplication (e.g., general, symmetric, triangular), custom flip-graph searches identify explicit low-rank decompositions, and the resulting schemes yield improved leading constants in recursive block-based algorithms, outperforming general-purpose schemes for the relevant structure (Khoruzhii et al., 13 Nov 2025).

5. Generalizations: Block-Convolutional, t-Product, and Higher-Order Schemes

The block multiplication paradigm generalizes to block-convolutional and t-product based tensor operations:

The classical “t-product” realizes multiplication as block-circulant convolution, efficiently diagonalizable via FFT, and admits SVD-like decompositions (t-SVD).
The $\star_c$ -product introduces block-reflective convolution, yielding block Toeplitz-plus-Hankel multiplication structures, diagonalizable via real DCT. Reflective boundary conditions via DCT lead to faster, real-valued arithmetic and better boundary handling in many applications (Molavi et al., 2023, Xu et al., 2019).
Both schemes allow direct, block-matrix interpretations and admit fast (quasi-linear) implementations, which are especially effective for domains such as image and signal processing.

The associated block Toeplitz-plus-Hankel (BTPH) matrices are diagonalized via DCT, formalizing the equivalence between the DCT-version of the t-product and a single large block-matrix operation (Xu et al., 2019).

6. Structural Constraints, Limitations, and Practical Implementation

The existence of an efficient tensor-decomposition-based multiplication depends on algebraic properties:

For blocked matrices, simultaneous diagonalizability or commutativity of diagonal blocks is generally required to admit efficient CP- or Kronecker-based decomposition (Wang, 2024).
Integer signatures in the decomposition support block-sparsity and hardware-friendly kernels.
Some decompositions (e.g., $T_1$ , $T_2$ , $T_3$ classes) provide sufficient but not necessary conditions for efficiency; such structure may not exist for arbitrary block shapes or for all $n$ .

Hidden constants in the $O(n^{5/2})$ or $\Theta(n^\omega)$ scalings depend on block sizes, rank of decomposition, eigendecomposition overhead, and data locality (Tichavsky, 2021, Wang, 2024). Not every matrix or block structure admits a useful tensor decomposition, and checking preconditions (e.g., commutativity) carries computational cost.

Block kernel implementation leverages compressed-sparse representations, vectorized instructions for the Hadamard stage, and careful scheduling for data reuse (Tichavsky, 2021, Wang et al., 2020). Parallelism is accessible at the granularity of independent rank-one or Kronecker terms.

7. Applications, General Block-Size Extension, and Recent Advances

Tensor-decomposition-based block multiplication is foundational in fast matrix multiplication, scientific computing, compressed RNNs, and high-dimensional data analysis. In RNNs, KCP decompositions deliver high compression and speedup ratios while maintaining accuracy, with parallelization potential due to independent branches (Wang et al., 2020).

For general block size extension:

Existing CP or Kronecker-structured decompositions are studied for low rank and integer signatures.
Factor signatures and their rank profiles are computed and integerized as needed.
De Groote-based sparsification drives implementation efficiency and memory footprint reduction.
Block-convolutional and t-product schemes support generalization to arbitrary tensor orders or additional dimensions (Tichavsky, 2021, Molavi et al., 2023, Xu et al., 2019).

Recent advances include lifting explicit low-rank schemes for structured blocks from finite fields to rational or integer-valued decompositions, recursion using non-square kernels, and optimization-driven discovery of fast block-multiplication schemes via flip-graph search and quantum-annealing-based techniques (Khoruzhii et al., 13 Nov 2025, Uotila, 2024).

References: