Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime

Published 28 May 2026 in cs.LG, cond-mat.dis-nn, and stat.ML | (2605.29684v1)

Abstract: The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a finite-width Bayesian DNN analysis using the Equivalent Wishart Ansatz to capture kernel fluctuations via scalar order parameters.
The methodology leverages large deviation principles and Wishart random matrices to accurately predict training loss and generalization errors across various architectures.
Results reveal smooth error accumulation with deeper networks and emergent metastability, highlighting the need for further theoretical refinement.

Kernel Renormalization in Bayesian Deep Neural Networks: The Equivalent Wishart Ansatz in the Proportional Regime

Introduction and Theoretical Motivation

This paper introduces a unifying, statistically grounded low-dimensional framework for analyzing finite-width Bayesian deep neural networks (DNNs) in the proportional regime, where both data sample size $P$ and network width $N$ scale proportionally. The focus is on the simultaneous presence of three critical features: finite width, nonlinear activation functions, and nontrivial network depth, bridging significant gaps left by previous work restricted to "lazy" infinite-width limits or shallow architectures.

The main results are built upon the Equivalent Wishart Ansatz (EWA), which leverages large deviation principles and properties of Wishart random matrix ensembles to approximate the empirical kernel fluctuations layer-wise in nonlinear DNNs. This framework substantially reduces the complexity of the otherwise high-dimensional posterior integration by mapping the stochastic dynamics of empirical kernels into a finite set of scalar order parameters, essentially retaining only the dominant (large deviations) fluctuations relevant for learning in the proportional limit.

Equivalent Wishart Ansatz: From Hierarchical Random Matrices to Low-dimensional Effective Theory

The foundational insight is that, due to the Gaussianity of weight priors, the empirical kernel at each layer—composed as the normalized outer product of post-activations—exhibits fluctuations well-approximated by a Wishart distribution in the proportional regime. Iteratively, this results in the hierarchical propagation of kernel fluctuations, but the critical realization is that for most directions relevant to the Bayesian integral, each contraction with a test vector is approximately $\chi^2$ -distributed (generalizing to noncentrality for activations with nonzero mean, e.g., ReLU).

Consequently, instead of needing to track $L$ high-dimensional $P \times P$ kernel matrices, the main body of macroscopic statistics relevant for learning is encoded in $L$ scalar random variables. The resulting partition function and predictions for observables such as training loss and generalization error are thus reformulated in terms of a renormalized kernel. The new kernel is a data-dependent scaled version of the large-width Neural Network Gaussian Process (NNGP) kernel, where the scale is self-consistently set by these order parameters.

The strong agreement between theory and exact sampling can be seen in learning curves for both training and generalization errors, with the EWA matching empirical Bayesian results across multiple architectures, datasets, and activation functions over a broad range of depth and sample complexity.

Figure 1: The EWA (green) provides precise predictions for both training loss and generalization error as a function of depth for zero-mean activation functions (Erf) on MNIST, clearly surpassing infinite-width approximations (gray).

Numerical Validation and Large Deviation Principle

A main technical ingredient and point of empirical validation is the large deviation principle (LDP) for the intensive variables $q_\ell = Q_\ell / N_\ell$ associated with each layer's kernel contraction. Both empirical and theoretical rate functions for these variables are compared in a variety of settings (architectures, datasets, activation functions, scaling ratios), consistently supporting the validity of the EWA across depth and width and showing vanishing deviations as $N, P \to \infty$ at fixed $\alpha = P/N$ .

Figure 2: Empirical rate functions for the product random variable $\mathcal{Q}$ match the EWA prediction for both ReLU and Erf activations, supporting large deviation principle applicability.

Figure 3: The empirical rate function matches theoretical predictions for various $N$ 0 and datasets, confirming the EWA’s universality in the proportional regime.

Extension to Convolutional Architectures and Output Dimensions

The analysis is further extended to multiple output DNNs and deep convolutional networks (CNNs). The kernel renormalization mechanism adapts by introducing hierarchical and spatially local order parameters mapped to patchwise correlations in CNNs. The stacked EWA derivation allows the effective kernel for CNNs to capture more structured, data-dependent renormalizations involving multi-patch order parameters, as opposed to scalar rescaling in fully-connected MLPs.

Figure 4: EWA-based kernel renormalization accurately traces the finite-width generalization behavior of shallow CNNs, highlighting the validity of local patch-by-patch renormalization.

Strong Empirical Results, Metastability, and Systematic Deviations

The agreement between EWA and numerically exact sampling methods (MCMC/Langevin/Hamiltonian algorithms) is documented for both zero-mean (Erf) and nonzero-mean (ReLU) activations and across datasets (MNIST, CIFAR-10, random Gaussian data), for depth up to $N$ 1 and sample sizes $N$ 2. EWA accurately predicts both training and generalization error, with deviations from infinite-width (NNGP) theory increasing with layer depth and $N$ 3.

Figure 5: EWA learns curves closely match Bayesian sampling for zero-mean activations on CIFAR-10, even beyond large-width approximations.

Figure 6: Finite-width effects (here for nonzero-mean activation on Gaussian data) are captured nonperturbatively by the EWA; the NNGP limit fails for increasing depth.

An important observation is the identification of two kinds of systematic deviations appearing for sufficiently large depth ( $N$ 4) and large load ( $N$ 5):

Smooth errors accumulation: A progressive breakdown of EWA accuracy as $N$ 6 grows, attributed to cumulative layer-wise approximation errors.
Metastability and transitions: Abrupt performance improvements linked to an emergent metastable regime, not explained by either large-width NNGP theory or EWA, and robust to choice of sampling algorithm.
Figure 7: Emergent metastability at large depth and load ( $N$ 7) is observed in numerical sampling, not accounted for by kernel renormalization theory.

Standard vs. Mean Field Parametrization and Adaptive Kernel Comparisons

By analyzing both standard (SP) and mean-field ( $N$ 8P) parametrizations, the work demonstrates that, in the proportional regime, improvements in generalization seen in rich learning scenarios are primarily attributed to variance suppression of the predictor, with little change in bias. Crucially, the EWA captures these macroscopic effects without needing high-dimensional vector-valued order parameters or adaptive kernel self-consistency equations, providing a tractable baseline for further adaptive kernel theories.

Figure 8: SP and $N$ 9P (mean-field) parametrizations display distinct generalization characteristics; EWA captures the effect, showing variance as the main source of improvements in $\chi^2$ 0P.

Practical and Theoretical Implications

Practically, the EWA offers a computationally tractable, physically interpretable, and broadly predictive baseline theory for finite-width Bayesian DNN performance in the proportional regime. Theoretically, it represents a significant dimensional reduction from adaptive kernel frameworks constrained by $\chi^2$ 1 matrix equations, to $\chi^2$ 2 scalar order parameters, while still accounting nonperturbatively for finite- $\chi^2$ 3 corrections.

This result suggests that, for many real-world scenarios (large but not infinite $\chi^2$ 4, $\chi^2$ 5), generalization can be understood largely through this low-dimensional random matrix structure, with more intricate feature adaptation or explicit high-dimensional kernel learning manifesting only outside the proportional regime (e.g., $\chi^2$ 6).

Future Directions

Any observed discrepancies in regimes of high depth and load—most notably the abrupt metastable transitions—indicate that further theory refinements and the development of new combinatorial or dynamical phenomena are needed for a complete description. Extensions to other architectures (e.g., max-pooling, transformers), larger datasets, and more intricate forms of feature adaptation remain as rich avenues for exploration, with the EWA as a baseline reference.

Conclusion

The Equivalent Wishart Ansatz forms a rigorously motivated, empirically validated, and computationally efficient core for the analysis of generalization in Bayesian deep (and convolutional) neural networks in the proportional regime. By reducing kernel adaptation to a handful of self-consistent scalar order parameters, it organizes much of finite-width DNN learning into a tractable and predictive statistical framework, and is vital for distinguishing genuinely high-dimensional adaptive feature learning from macroscopic finite-size effects. The theory offers immediate implications for interpreting overparameterized scaling laws in practical AI, provides baseline results against which more complex adaptive kernel theories can be compared, and clarifies the role of scaling, depth, width, and data complexity in modern deep learning.

Markdown Report Issue