Nonlinear Matrix Decompositions (NMD)

Updated 22 December 2025

Nonlinear Matrix Decompositions are methods that approximate a data matrix by applying nonlinear functions to low-rank factors, enabling enhanced modeling for sparse and nonnegative data.
The approach employs various algorithmic paradigms such as block coordinate descent, ADMM, and gradient-based methods, which improve convergence and computational efficiency.
Empirical studies show that NMD techniques yield lower reconstruction errors and better memory efficiency than traditional linear SVD methods in diverse applications.

Nonlinear Matrix Decompositions (NMD) generalize classical low-rank matrix approximations by introducing nonlinearity between latent factors and the data reconstruction. In the canonical setting, given a target rank $r$ and data matrix $X \in \mathbb{R}^{m \times n}$ , NMD seeks factors $W \in \mathbb{R}^{m \times r}$ and $H \in \mathbb{R}^{r \times n}$ such that $X \approx f(WH)$ , where $f$ is an element-wise nonlinear function. This abstraction subsumes models with diverse application domains, including data compression, manifold learning, matrix completion, nonnegative and structured data modeling, and dynamical system analysis. Recent years have seen focused methodological development, particularly for nonlinearities related to ReLU (rectified linear unit), radial basis functions, and broader activation functions, as well as algorithmic advances facilitating large-scale and robust computation.

1. Mathematical Frameworks for NMD

The defining property of NMD is the incorporation of a nonlinear function $f$ in the reconstruction: $\min_{W, H} \; L\big(X,\, f(WH)\big)$ where $L$ is a loss, usually $\ell_2$ (Frobenius), $\ell_1$ , or divergence (e.g., KL) (Awari et al., 19 Dec 2025, Seraghiti et al., 2023). Choices for $f$ commonly found in the literature include:

ReLU: $f(x) = \max(0, x)$ , suited for nonnegative and sparse matrices.
Elementwise square: $f(x) = x^2$ , applicable to probabilistic and circuit representations.
Bounded transform: $f(x) = \min(b, \max(a, x))$ , as in recommender systems.

An alternative family represents each matrix component with a parameterized nonlinear kernel, e.g., radial basis function (RBF) kernels,

$[\Phi_k]_{ij} = \varphi(a_{k,i}, b_{k,j})$

and

$X \approx \sum_{k=1}^r \Phi_k$

where $\varphi$ can be Gaussian, multiquadric, etc. (Rebrova et al., 2021).

Formulations may be further modified to address explicit rank constraints ( $\operatorname{rank}(WH) = r$ ), regularization (nuclear norm, Tikhonov), or robustness to missing/outlying data (Seraghiti et al., 2023, Wang et al., 2024).

2. Principal Algorithmic Paradigms

A variety of iterative optimization methods are proposed for NMD, each tailored to the structure induced by the chosen nonlinearity.

Block Coordinate and Alternating Minimization

Block coordinate descent (BCD) schemes exploit the separable structure of auxiliary variables:

Alternating update of nonlinear latent variable $Z$ $Z$ and (possibly low-rank) $\Theta$ $Θ$ , e.g., for ReLU models:
- $Z_{ij} = X_{ij}$ for nonzero entries, else $\min(0, \Theta_{ij})$ .
- Projection of $Z$ to low-rank via truncated SVD (Seraghiti et al., 2023, Gillis et al., 31 Mar 2025).
Three-block factorization: parameterize $\Theta = WH$ and alternately update $Z$ , $W$ , $H$ through projection and least squares (Seraghiti et al., 2023, Gillis et al., 31 Mar 2025).

Extrapolation and Momentum

Adaptive extrapolation (e.g., Nesterov, blockwise positive/negative momentum) is incorporated for acceleration:

Extrapolate both $Z$ and $\Theta$ (or $W,H$ ), tuning the momentum parameter $\beta$ adaptively or with split signs for stability (Seraghiti et al., 2023, Wang et al., 2024, Gillis et al., 31 Mar 2025).
Residual-corrected schemes (e.g., eBCD-NMD) use past iterates and current residuals to adaptively extrapolate, yielding faster convergence (Gillis et al., 31 Mar 2025).

Proximal and ADMM Approaches

Alternating Direction Method of Multipliers (ADMM) approaches handle a wide array of nonlinearities and losses:

Introduce an auxiliary variable $Z = f(WH)$ .
Minimize the augmented Lagrangian with respect to $Z$ , $W$ , $H$ , and dual variable $Y$ in an alternating scheme (Awari et al., 19 Dec 2025).
Proximal updates for $Z$ can be computed in closed form for typical choices of $f$ (ReLU, square, MinMax) and $L$ (Frobenius, $\ell_1$ , KL divergence).

Gradient-based RBF Decomposition

Gradient descent and stochastic variants (e.g., Adam) are employed for RBF-based NMD:

Each parameter vector in the RBF kernel is updated via backpropagated gradients of Frobenius loss (Rebrova et al., 2021).

3. Theoretical Properties and Well-Posedness

NMD problems are generally nonconvex and, for many nonlinearities (notably ReLU), nonsmooth. The following properties are established:

For ReLU-NMD, both the direct formulation and latent-variable (with $Z$ ) formulation can yield different $\Theta$ ; the latent formulation may be ill-posed even when the original is not, as shown via explicit matrix examples (Gillis et al., 31 Mar 2025).
BCD and ADMM schemes for factorized three-block models have convergence guarantees under the boundedness of iterates and mild regularity assumptions (Kurdyka–Łojasiewicz property, subanalyticity) (Wang et al., 2024, Gillis et al., 31 Mar 2025, Awari et al., 19 Dec 2025).
Momentum-accelerated algorithms can be globally convergent to critical points, and blockwise use of positive or negative momentum enhances stability and acceleration (Wang et al., 2024).
For RBF decompositions, the non-convex loss landscape typically results in multiple critical points—many random restarts are often used in practice to find a near-optimal solution (Rebrova et al., 2021).

4. Initialization Strategies and Computational Considerations

Initialization is critical due to the nonconvexity of NMD.

Nuclear-norm minimization under affine constraints is used for ReLU-based problems: solve $\min_\Theta \|\Theta\|_*$ with $\Theta_{ij}=X_{ij}$ for $X_{ij}>0$ , $\Theta_{ij}\le 0$ for $X_{ij}=0$ , followed by rank- $r$ SVD truncation (Seraghiti et al., 2023).
For RBF-NMD, small random initializations of parameter vectors are employed, with parallel random restarts (Rebrova et al., 2021).

Per-iteration complexities:

For three-block NMD (with factorized model), each iteration scales as $O(m n r)$ (Seraghiti et al., 2023).
For ADMM, the total per-iteration cost is $O(m n r + r^3 + m n)$ , dominated by matrix multiplications and small matrix inversions (Awari et al., 19 Dec 2025).
For RBF, parameter storage is $O(r(m+n))$ with potential for $2$– $6\times$ memory reduction versus linear SVD for matched error (Rebrova et al., 2021).

5. Empirical Performance and Comparative Studies

Extensive experiments benchmark NMD schemes across synthetic, image, text, and kernel data:

ReLU-NMD (including accelerated, three-block, and momentum schemes) consistently outperforms linear SVD and TSVD in reconstructing sparse nonnegative data, with reductions in error and computational time (Seraghiti et al., 2023, Gillis et al., 31 Mar 2025, Wang et al., 2024).
In image data (e.g., MNIST, CBCL faces), three-block NMD and momentum-accelerated NMD provide the best memory–accuracy trade-offs; NMF basis compression via NMD achieves lower error than TSVD (Seraghiti et al., 2023, Wang et al., 2024).
RBF-based NMD yields $2$– $6\times$ reduction in memory relative to SVD for a fixed error across Gaussian, graph, and kernel matrices, and outperforms SVD visually and quantitatively for edge preservation in images (Rebrova et al., 2021).
ADMM-based NMD unifies diverse loss/nonlinearity combinations and achieves lower error and greater robustness to outliers/poisson noise compared to classical weighted low-rank approximation, and it is up to eight times faster than existing coordinate descent alternatives on benchmark datasets (Awari et al., 19 Dec 2025).

Sample Table: Comparative Performance of ReLU-NMD Algorithms on MNIST ( $r=40$ ) (Wang et al., 2024):

Method	Relative Error (Tol)	Time (s)
EM-NMD	0.120	20.0
A-EM	0.105	20.0
3B-NMD	0.085	15.2
A-NMD	0.078	18.1
NMD-TM (momentum)	0.080	12.5

6. Model Variants and Extensions

NMD models are not restricted to the Frobenius objective or standard nonlinearities:

Loss variants: $\ell_1$ and KL divergence adapt NMD to robustness and probabilistic settings (Awari et al., 19 Dec 2025).
Nonlinearities: the ADMM-based approach can accommodate arbitrary entrywise functions (e.g., softplus, modulus, Huber, logistic sigmoid) as long as scalar proximal updates are available (Awari et al., 19 Dec 2025).
Tensor decompositions: CP and Tucker analogues under ReLU and other nonlinearities are suggested as future work (Wang et al., 2024).
RBF-NMD offers a proximity-based intrinsic geometry, beneficial for manifold learning and unsupervised structure discovery (Rebrova et al., 2021).

NMD methods are deployed across a range of scientific and engineering domains:

Data compression: compact encoding of sparse or nonnegative matrices, often used for large-scale vision or text datasets (Seraghiti et al., 2023, Rebrova et al., 2021).
Matrix completion: robust handling of missing-not-at-random (MNAR) entries, with ReLU sampling offering superior recovery in challenging regimes (Gillis et al., 31 Mar 2025).
Recommender systems: bounded nonlinearities (MinMax) provide a natural fit for preference matrices with explicit upper/lower limits (Awari et al., 19 Dec 2025).
Robust PCA and circuit representations: square nonlinearity supports modeling of physical systems with quadratic activations (Awari et al., 19 Dec 2025).
Power systems: unrelated usage of the acronym NMD (Nonlinear Modal Decoupling) for dynamical stability analysis via coordinate transformation and Lyapunov-theoretic analysis; this approach is unrelated to matrix factorizations and instead focuses on modal transformation of ODE systems (Wang et al., 2018).

In summary, Nonlinear Matrix Decomposition encompasses a family of models and algorithms with demonstrated superiority over linear counterparts for structured data modeling, robust compression, and recovery in the presence of nonlinearity or sparsity. The field remains active, especially in adapting flexible optimization schemes (e.g., ADMM) for broader classes of nonlinearities and losses, integrating acceleration strategies, and extending theory on global and local optimality.