Papers
Topics
Authors
Recent
Search
2000 character limit reached

Integral Infinite Width Neural Representation

Updated 26 December 2025
  • Integral infinite width neural representation is a framework that expresses neural networks as integrals over parameter space, capturing their limit as the number of units grows indefinitely.
  • The approach rigorously connects finite sum approximations to continuous integral formulations with explicit error bounds under uniform continuity assumptions.
  • It unifies perspectives from random feature models, kernel methods, Gaussian processes, and operator learning, providing a solid theoretical foundation for modern deep learning.

An integral infinite width neural representation characterizes the limit of neural network architectures as the number of units per layer (and, in some generalizations, the number of layers) tends to infinity. In this regime, neural networks can be described as nonlinear integral operators over parameter space, unifying perspectives from random feature models, kernel machines, Gaussian processes, neural tangent kernels, neural fields, and continuous-depth models. This representation is foundational in modern theoretical analyses of deep learning and gives rise to nontrivial connections with approximation theory, infinite-dimensional statistics, and efficient kernel-based computation.

1. Mathematical Formulation of Integral Infinite Width Representations

Consider a single-hidden-layer neural network with input xRpx \in \mathbb{R}^p and output

fN(x)=i=1Nαiσ(wix+bi),f_N(x) = \sum_{i=1}^N \alpha_i\, \sigma(w_i \cdot x + b_i),

where σ\sigma is the activation, and (αi,wi,bi)(\alpha_i, w_i, b_i) are the output weights, input weights, and biases. In the infinite width limit (NN \to \infty), under suitable regularity and scaling assumptions, this sum converges to the integral operator

f(x)=Ωα(θ)σ(w(θ)x+b(θ))dμ(θ),f(x) = \int_{\Omega} \alpha(\theta)\, \sigma(w(\theta) \cdot x + b(\theta))\, d\mu(\theta),

with Ω\Omega a compact parameter domain (e.g., Ω=[0,1]\Omega = [0,1]), μ\mu a finite Borel (often Lebesgue) measure, and (w(θ),b(θ)),α(θ)(w(\theta), b(\theta)), \alpha(\theta) continuous functions of the parameter θΩ\theta \in \Omega (Prieur et al., 19 Dec 2025). For standard random initialization (θi\theta_i i.i.d. from a probability measure μ0\mu_0), this is the mean-field limit in the law of large numbers sense (Bahri et al., 2023, Hajjar et al., 2021).

For deep networks with LL hidden layers, the representation becomes a nested or recursive integral:

f(x)=ΘL+1aϕ(wL+1UL(x)+b)dμL+1(θL+1),f(x) = \int_{\Theta_{L+1}} a\, \phi\left(w^{L+1} \cdot U^{L}(x) + b\right) d\mu_{L+1}(\theta^{L+1}),

where each U(x)U^\ell(x) is a Gaussian random field generated as a function over previous layer parameters and fields, with integration over the joint parameter spaces of all layers (Bahri et al., 2023, Prieur et al., 19 Dec 2025). Equivalently, this can be expressed as a high-dimensional compositional integral over all network parameters.

2. Emergence from Finite-Width Networks and Error Bounds

The passage from a finite network to the integral form relies on identifying the finite sum as a Riemann approximation to the integral. Given a partition {θi}i=1NΩ\{\theta_i\}_{i=1}^N \subset \Omega with mesh ΔθN1\Delta\theta \sim N^{-1}, and piecewise-constant αiα(θi)Δθ\alpha_i \approx \alpha(\theta_i) \Delta\theta, (wi,bi)(w(θi),b(θi))(w_i, b_i) \approx (w(\theta_i), b(\theta_i)),

fN(x)Ωα(θ)σ(w(θ)x+b(θ))dμ(θ)f_N(x) \longrightarrow \int_{\Omega} \alpha(\theta)\, \sigma(w(\theta) \cdot x + b(\theta)) d\mu(\theta)

as NN\to\infty, provided the involved functions are uniformly continuous and of bounded variation (Prieur et al., 19 Dec 2025).

Explicit error bounds have been established: under uniform continuity and bounded variation assumptions on α,w,b,σ\alpha, w, b, \sigma over Ω\Omega, for input xx bounded in norm, the discretization error satisfies

f(x)fN(x)CN,\|f(x) - f_N(x)\| \le \frac{C}{N},

uniformly for xr\|x\| \le r, where CC depends on the modulus of continuity and parameters (Prieur et al., 19 Dec 2025). The proof decomposes the discrepancy into pointwise approximation errors and standard Riemann sum errors.

For single-layer random feature models, the law of large numbers ensures convergence of the empirical sum to the corresponding integral representation.

3. Connections to Kernels, Gaussian Processes, and Kernel Regimes

In the infinite width limit, neural networks at initialization are closely connected to reproducing kernel Hilbert spaces (RKHS), Gaussian processes, and classical kernels. The covariance kernel for the network output in the single-layer case,

k(x,x)=a2ϕ(wx+b)ϕ(wx+b)dμ0(θ),k(x, x') = \int a^2\, \phi(w \cdot x + b)\, \phi(w \cdot x' + b)\, d\mu_0(\theta),

reduces to

k(x,x)=E(w,b)[ϕ(wx+b)ϕ(wx+b)]k(x, x') = \mathbb{E}_{(w, b)}[ \phi(w \cdot x + b)\, \phi(w \cdot x' + b)]

for aN(0,1)a \sim N(0,1) (Bahri et al., 2023, Prieur et al., 19 Dec 2025, Arora et al., 2019).

For deeper networks, the equivalent recursion for the so-called NNGP kernel is

K+1(x,x)=σb2+σw2E(u,v)N(0,Σ)[ϕ(u)ϕ(v)],K^{\ell+1}(x,x') = \sigma_b^2 + \sigma_w^2\, \mathbb{E}_{(u,v)\sim N(0, \Sigma^\ell)}[\phi(u)\phi(v)],

with Σ\Sigma^\ell constructed from the covariance values of KK^\ell (Bahri et al., 2023, Arora et al., 2019).

Under full (all-layer) training by gradient flow, the empirical “neural tangent kernel” (NTK)

Θ(x,x)=μf(x)θμf(x)θμ=θϕ(wx+b),θϕ(wx+b)dμ0(θ)\Theta(x, x') = \sum_\mu \frac{\partial f(x)}{\partial \theta_\mu} \frac{\partial f(x')}{\partial \theta_\mu} = \int \langle \nabla_\theta \phi(w \cdot x + b), \nabla_\theta \phi(w \cdot x' + b) \rangle d\mu_0(\theta)

remains fixed at its initialization value (Bahri et al., 2023, Arora et al., 2019). The learning dynamics of such infinite-width networks reduce to kernel regression with this kernel. This establishes a formal equivalence with kernel gradient flow and gives rise to non-asymptotic equivalence theorems for finite but wide networks (Arora et al., 2019).

Convolutional neural tangent kernels (CNTK) and their associated architectures also admit exact integral representations, with closed-form expressions for the corresponding kernels in certain cases (e.g., ReLU activation) (Arora et al., 2019).

4. Generalizations: Deep, Residual, and Continuous-Depth Architectures

The integral infinite width perspective extends directly to deep and residual networks, neural ODEs, and neural fields. For deep networks with a finite number of integral hidden layers, the nested-integral structure is preserved, and the continuum approximation arises as both width and (possibly) depth tend to infinity (Bahri et al., 2023, Prieur et al., 19 Dec 2025). Approximation errors can then be analyzed via discretization techniques from numerical analysis, relating depth/width growth to error rates.

The DiPaNet ("Distributed Parameter neural Network") framework generalizes these ideas, creating a unified representation encompassing finite, infinite-width, and continuous-depth (neural ODE) limits, as well as neural field models common in computational neuroscience. Different neural architectures—such as shallow Barron-space networks, continuous-width networks, and integro-differential neural fields—are linked via appropriate choices of parameter domain, measure, and kernel structure in the integral representation (Prieur et al., 19 Dec 2025).

Residual architectures and operator-learning neural field models can be subsumed by letting the integration domain correspond to spatial or manifold coordinates, and the kernel and activation functions be suitably chosen nonlinear or operator-valued maps.

5. Representation Theory and Connection to Function Classes

The infinite-width integral representation for networks characterizes precisely the class of functions that can be realized as superpositions of nonlinear ridge functions with weights prescribed by signed (possibly vector-valued) measures on parameter space. In the ReLU case, a shallow network representation

f(x)=Rn×R[wTx+b]+dμ(w,b)+c0f(x) = \int_{\mathbb{R}^n \times \mathbb{R}} [w^T x + b]_+\, d\mu(w, b) + c_0

describes all representations over finite signed measures μ\mu (McCarty, 2023). For ReLU activations, the induced function class is continuous and (countably) piecewise linear, with "creases" along the hyperplanes given by the support of μ\mu. The total variation μ\|\mu\| serves as a complexity measure akin to the "Barron norm."

It has been proven that if a finitely piecewise linear function has a finite-cost integral representation (finite total variation measure), then it is exactly realizable by a finite-width network. Thus, genuine benefit from infinitely wide networks arises only for function classes (e.g., infinite ridge combinations, analytic kernels) not finitely representable by finite-width architectures (McCarty, 2023).

6. Gradient Flow, Learning Dynamics, and Limitations

Gradient flow in the infinite-width integral regime can be formalized as PDEs for parameter distribution evolution. For two-layer models with integrable parameterizations (IP scaling), as mm \rightarrow \infty, the empirical distribution of neuron parameters converges weakly to a deterministic law, and the network output is recovered via the integral

f(x)=aσ(wx)dρ(w,a).f(x) = \int a\, \sigma(w \cdot x)\, d\rho(w, a).

For multilayer networks, IP scaling aligns gradients, weights, and learning rates to control infinite-width limits (Hajjar et al., 2021). However, standard IP parameterizations with naive learning rates lead to trivial stationary points (no learning) for deep networks (depth >4>4), under which the network function and gradients remain frozen at initialization (Hajjar et al., 2021). Remedies, such as large initial learning rates (IP-LLR), or equivalence to μ\muP parameterization, reintroduce nontrivial learning by appropriately scaling update magnitudes and preserving dynamical feature learning.

Empirically, IP-LLR recovers nontrivial learning dynamics, although the effect is sensitive to activation choice: for non-smooth activations such as ReLU, feature learning and test accuracy degrade, with deep layers showing rank collapse in learned features, whereas smoother activations (ELU, GeLU, tanh) maintain higher feature rank and generalization (Hajjar et al., 2021). Non-centered initialization and inappropriate bias scaling can also degenerate learning and representation.

7. Unification with Random Features, Kernels, and Operator Learning

The integral infinite width representation subsumes and unifies classical random feature models, kernel methods, operator learning, and neural field paradigms. For instance, random feature models with i.i.d. draws from a base measure, with equal weights, yield random approximations to the integral representation via Monte-Carlo sums (Prieur et al., 19 Dec 2025). Kernel ridge regression and random Fourier feature approximations correspond to specific choices of σ\sigma and parameter measure.

Operator learning and neural field models emerge by letting the domain of integration be a spatial or manifold variable, with kernels or activation functions as continuous operators. This connects to continuous neural networks in the sense of Le Roux & Bengio (2007) and general integro-differential neural fields, unifying perspectives from machine learning, control, and computational neuroscience (Prieur et al., 19 Dec 2025).

The integral viewpoint thus offers a rigorous mathematical foundation for analyzing deep, wide, and continuous neural networks, as well as a comprehensive bridge connecting contemporary learning theory, functional analysis, and stochastic approximation in high dimensions.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Integral Infinite Width Neural Representation.