Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models

Published 16 Jun 2025 in stat.ML and cs.LG | (2506.13139v1)

Abstract: Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data and rely on overparameterized models, where classical low-dimensional intuitions break down. In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large and comparable, gives rise to novel and sometimes counterintuitive behaviors. This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models such as DNNs in this regime. We introduce the concept of High-dimensional Equivalent, which unifies and generalizes both Deterministic Equivalent and Linear Equivalent, to systematically address three technical challenges: high dimensionality, nonlinearity, and the need to analyze generic eigenspectral functionals. Leveraging this framework, we provide precise characterizations of the training and generalization performance of linear models, nonlinear shallow networks, and deep networks. Our results capture rich phenomena, including scaling laws, double descent, and nonlinear learning dynamics, offering a unified perspective on the theoretical understanding of deep learning in high dimensions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces High-dimensional Equivalents that extend traditional RMT to analyze nonlinear deep learning models.
The methodology employs deterministic equivalents and linearization techniques to capture training errors and double descent phenomena.
The framework uses resolvent and kernel analyses to derive performance metrics for both linear and deep neural network models.

This paper, "Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models" (2506.13139), provides an overview of recent advances in Random Matrix Theory (RMT) for analyzing modern Deep Learning (DL) models, particularly in the high-dimensional proportional regime where sample size $n$ , input dimension $p$ , and model parameters $d$ are all large and comparable. The core argument is that traditional RMT, focused on eigenvalue distributions of linear models, is insufficient for understanding complex Deep Neural Networks (DNNs). The paper introduces a framework to extend RMT to analyze performance metrics (beyond eigenvalues) of nonlinear and structured models like DNNs.

A central concept introduced is the High-dimensional Equivalent (HiE). Given a random matrix model $\mathcal{M}_{\phi}(X)$ (possibly nonlinear, with $\phi$ an entrywise function) and a scalar performance metric $f(\cdot)$ , a model $\tilde{\mathcal{M}_\phi(X)}$ (random or deterministic) is a High-dimensional Equivalent of $\mathcal{M}_{\phi}(X)$ with respect to $f(\cdot)$ if their performance metrics converge: $f(\mathcal{M}_{\phi}(X)) - f(\tilde{\mathcal{M}_{\phi}(X)) \to 0$ as $n,p \to \infty$ with $p/n \to c \in (0,\infty)$ . This is denoted $\mathcal{M}_{\phi}(X) \overset{f}{\leftrightarrow} \tilde{\mathcal{M}_{\phi}(X)}$ .

The paper discusses two main special cases of HiE:

Deterministic Equivalent (DE): For linear models ( $\mathcal{M}_{\phi}(X) = X$ ), if $\tildeX$ is deterministic and $f(X) - f(\tildeX) \to 0$, then $\tildeX$ is a DE of $X$ .
Linear Equivalent (LE): For a nonlinear model $\mathcal{M}_{\phi}(X)$ , if $\tilde{\mathcal{M}_{\phi}(X)}$ is a linear model and $f(\mathcal{M}_{\phi}(X)) - f(\tilde{\mathcal{M}_{\phi}(X)}) \to 0$ , then $\tilde{\mathcal{M}_{\phi}(X)}$ is an LE of $\mathcal{M}_{\phi}(X)$ .

Deterministic Equivalent for Resolvent

The paper emphasizes that while high-dimensional random vectors/matrices don't concentrate around their means in terms of norm, their scalar functionals often do. This allows the use of DEs. For analyzing scalar eigenspectral functionals, which depend on both eigenvalues and eigenvectors (e.g., $f(X) = \frac1{|\mathcal{I}|} \sum_{i \in \mathcal{I}} f(\lambda_i(X)) \mathbf{a}^\top u_i u_i^\top \mathbf{b}$ ), the paper leverages the resolvent matrix $Q_X(z) = (X - z I_n)^{-1}$ . Theorem 1 (Scalar eigenspectral functional via contour integration) states that such functionals can be expressed using contour integration of the resolvent: $f(X) = -\frac1{2\pi\jmath |\mathcal{I}|} \oint_{\Gamma_{\mathcal I}} f(z) \mathbf{a}^\top Q_X(z) \mathbf{b}\,dz$ . This allows the analysis of $f(X)$ by finding a DE for the term $\mathbf{a}^\top Q_X(z) \mathbf{b}$ .

High-dimensional Linearization

To handle nonlinearity in ML models, the paper introduces linearization techniques based on two scaling regimes for a scalar functional $f(x)$ of a high-dimensional random vector $x \in \mathbb{R}^n$ :

LLN regime: $f(x) - \mathbb{E}[f(x)] \to 0$ . Here, Taylor's theorem can linearize $\phi(f(x))$ around $\mathbb{E}[f(x)]$ .
CLT regime: $\sqrt{n}(f(x) - \mathbb{E}[f(x)])$ converges to a non-degenerate distribution (e.g., Gaussian). Here, $\mathbb{E}[\phi(f(x))]$ can be analyzed using orthogonal polynomial expansions (e.g., Hermite polynomials if $f(x)$ is Gaussian).

These linearization techniques help find a Linear Equivalent for nonlinear models.

Applications to ML Models

1. Linear Random Matrix Model:

The paper analyzes the Sample Covariance Matrix (SCM) $\hat{C} = \frac{1}{n}XX^\top$ and Gram matrix $G = \frac{1}{n}X^\top X$ . Theorem 2 (Deterministic Equivalents for SCM and Gram resolvents) provides DEs for their resolvents. For $X = C^{1/2}Z$ (where $Z$ has i.i.d. sub-gaussian entries, $C$ deterministic), $Q_{\hat{C}}(z) \leftrightarrow \tilde{Q}_{\hat{C}}(z) = \left( \frac{C}{1+\delta(z)} - z I_p \right)^{-1}$ , where $\delta(z) = \frac{1}{n} \tr(\tilde{Q}_{\hat{C}}(z)C)$. A special case ( $C=I_p$ ) leads to the Marchenko-Pastur law.

This framework is applied to linear least squares regression. Proposition 1 (Risk of linear ridge regression) characterizes in-sample and out-of-sample risks. In the proportional regime, RMT accurately predicts these risks, revealing phenomena like:

Scaling law of in-sample risk: For ridgeless regression ( $n>p$ ), $R_{\text{in}} \approx \sigma^2 \frac{p}{n}$ .
Double descent of out-of-sample risk: The risk can increase with $n$ when $n \approx p$ , peaking at $n=p$ , unlike the monotonic decrease in the classical $n \gg p$ regime.

2. Single-hidden-layer NN Model:

The model output is $\hat{y}(x) = \alpha^\top \phi(Wx)$ . The analysis focuses on the nonlinear Gram matrix $\frac{1}{d}\Phi^\top\Phi$ where $\Phi = \phi(WX)$ . Theorem 3 (Deterministic Equivalent for nonlinear resolvent) extends the DE concept to the resolvent of this nonlinear Gram matrix, $Q(z) = (\frac{1}{n}\Phi^\top\Phi - z I_n)^{-1}$ , showing $Q(z) \leftrightarrow \tilde{Q}(z) = \left( \frac{d}{n} \frac{K}{1+\delta(z)} - z I_n \right)^{-1}$ , where $K = \mathbb{E}_w[\phi(X^\top w)\phi(w^\top X)]$ is the neural network kernel and $\delta(z) = \frac{1}{n}\tr K\tilde{Q}(z)$. Proposition 2 (Asymptotic training and test MSEs) uses this DE to derive expressions for MSEs, which can also exhibit double descent. The paper also discusses the scaling law of training MSE, which depends on the eigenspectrum of $K$ and the activation function.

High-dimensional linearization of the kernel matrix $K$ :

Theorem 4 shows that for random spherical inputs and activations with Hermite coefficients $a_{\phi;0}, a_{\phi;1}, a_{\phi;2}, \nu_{\phi}$ : $K \leftrightarrow \tilde{K}_\phi = a_{\phi;0}^2 \mathbf{1}\mathbf{1}^\top + a_{\phi;1}^2 X^\top X + a_{\phi;2}^2 \cdot \frac{1}{p} \mathbf{1}\mathbf{1}^\top + (\nu_\phi - a_{\phi;0}^2 - a_{\phi;1}^2) I_n$ . This implies that for unstructured data, the eigenvalue distribution of $K$ (up to scaling/shifting) is often similar to that of $X^\top X$ (Marchenko-Pastur like), largely independent of $\phi$ .

The learning dynamics (gradient flow) of the second-layer weights $\alpha(t)$ can also be analyzed using the resolvent of $\frac{1}{n}\Phi^\top\Phi$ .

3. Beyond Single-hidden-layer NN Models (DNNs):

For an $L$ -layer DNN, the paper analyzes the Conjugate Kernel (CK) matrix at layer $\ell$ , $K_\ell = \mathbb{E}[\Phi_\ell^\top \Phi_\ell]$ . Theorem 5 (High-dimensional linearization of CK matrices for DNN) shows that under certain conditions on activations ( $a_{\phi_\ell;0}=0, \nu_{\phi_\ell}=1$ ) and random spherical inputs: $K_\ell \leftrightarrow \tilde{K}_{\phi,\ell} = \alpha_{\ell,1}^2 X^\top X + \alpha_{\ell,2}^2 \cdot \frac{1}{p} \mathbf{1}\mathbf{1}^\top + (1 - \alpha_{\ell,1}^2) I_n$ . The coefficients $\alpha_{\ell,1}, \alpha_{\ell,2}$ evolve recursively:

$\alpha_{\ell,1} = a_{\phi_\ell;1} \cdot \alpha_{\ell-1,1}$

$\alpha_{\ell,2} = \sqrt{ a_{\phi_\ell;1}^2 \cdot \alpha_{\ell-1,2}^2 + a_{\phi_\ell;2}^2 \cdot \alpha_{\ell-1,1}^4 }$ . This can lead to a "curse of depth" where $K_L \to I_n$ if $a_{\phi_\ell;1} < 1$ , making the DNN output data-independent for deep random networks with unstructured inputs.

The paper also discusses learning dynamics of DNNs using the Neural Tangent Kernel (NTK). In the ultra-wide limit, $K_{\text{NTK}}(t)$ is nearly constant and close to $K_{\text{NTK}}(t=0)$ . The NTK can be related to CKs recursively: $K_{\text{NTK}, \ell}=K_{\ell}+K_{\text{NTK}, \ell-1} \circ K_{\ell}^{\prime}$ . Linearizing the NTK using the RMT framework is proposed as a future direction.

Conclusion

The paper successfully demonstrates that RMT can be extended beyond its traditional confines to provide powerful analytical tools for understanding complex, nonlinear DL models in high-dimensional settings. The introduced concepts of High-dimensional Equivalent, Deterministic Equivalent for Resolvent, and High-dimensional Linearization allow for precise characterizations of training error, generalization performance (including double descent), and learning dynamics. The authors suggest that integrating these methods with non-asymptotic RMT will be crucial for future progress.

Markdown