- The paper demonstrates that one-layer neural networks with fixed biases achieve universal L2 approximation under suitable activation functions.
- It rigorously proves gradient descent convergence and quantifies spectral bias using operator-theoretic methods for both ReLU and the proposed FReX activation.
- The study introduces FReX as a mathematically justified alternative to ReLU, offering faster spectral decay and improved localized kernel properties.
Mathematical Analysis of One-Layer Neural Networks with Fixed Biases and a New Activation Function
Overview
This paper provides a rigorous functional analytic study of one-hidden-layer neural networks with fixed biases in both continuous and discretized settings, targeting one-dimensional function approximation with the ReLU activation. It delivers detailed convergence proofs for gradient descent with quadratic loss, a spectral bias analysis grounded in the operator spectrum, and deduces precise criteria for activation function suitability, ultimately proposing and analyzing the "full-wave rectified exponential" (FReX) as an alternative to ReLU. Both empirical and spectral properties of these models are elucidated.
Model Formulation and Main Reductions
The authors consider two primary models:
- Continuous model: The standard two-layer neural network is shown, via a change of variables and parameter absorption, to reduce to a one-layer network with fixed first-layer weights and biases:
f(x)=∫−∞∞​w(z)σ(x−z)dz+b+cx,
where σ is an activation (initially ReLU), w(z) is a trainable second-layer weighting function, and b,c are bias and linear parameters.
- Discrete model: The input domain [0,1] is partitioned into N intervals, with corresponding weights w(tj​) and prescribed biases. The function space X comprises continuous, piecewise-linear functions on the grid. The network representation involves summation over shifted activations.
The main theoretical reduction demonstrates that in the continuous setting, the network with prescribed first-layer biases and a single trainable weighting function is sufficient for universal L2 approximation (on [0,1]), provided the activation function is a fundamental solution to an appropriate second-order operator.
Rigorous Gradient Descent Convergence
Leveraging the linear dependence of network outputs on parameters and using operator-theoretic methods, the paper proves strong convergence of full-batch gradient descent for quadratic loss, both in the continuous and discrete formulations:
- Continuous: For target functions σ0, iterates σ1 under gradient descent converge strongly (in norm) to the unique minimizer representing σ2 via the neural network. Smoothness of σ3 affects the convergence rate; for σ4, the error decays as σ5.
- Discrete: In the discretized setting on a grid, analogous convergence results hold. The critical operator is finite-dimensional, enabling spectral gap arguments and uniform operator-norm decay bounds.
The proofs utilize the self-adjointness and positivity of the induced integral operator σ6 (resp. its finite-dimensional counterpart), together with the compactness properties of convolution with the activation.
Spectral Bias and Frequency Principle
The authors conduct a thorough operator spectral analysis, quantifying the "spectral bias" in learning:
- The error after σ7 iterations decomposes as
σ8
where σ9 and w(z)0 are the eigenvalues and eigenfunctions of the network operator.
- For the ReLU network, the relevant integral operator w(z)1 inverts a fourth-order differential operator, yielding w(z)2, so learned frequencies scale as w(z)3 with iteration count. This aligns precisely with observed frequency dynamics in neural network training literature (the so-called "frequency principle").
The frequency-wise learning rates are thus determined by the spectral decay of the associated functional-analytic operator.
Activation Function Properties: From ReLU to FReX
A significant conceptual outcome is the identification of a key mathematical property for activation functions: being the fundamental solution to a second-order differential operator. For ReLU,
w(z)4
implying any w(z)5 function can be represented, and the network is exactly parameterized (i.e., not overparameterized as with standard wide networks).
The paper proposes the FReX activation,
w(z)6
which is the Green's function for w(z)7. Critically, FReX is continuous, non-differentiable at zero (mirroring ReLU), but exponentially decaying and integrable, yielding a localized kernel with faster spectral decay.
Analysis of the corresponding network yields similar representability and convergence properties. The spectral bias calculations can be performed explicitly in Fourier space: learning high-frequency components is slow as w(z)8 grows with increasing w(z)9, but the influence is more localized due to FReX’s decay.
Empirical and Structural Observations
- Bias parameterization: The analysis and empirical investigations suggest that fixed, regularly distributed biases suffice for expressivity in these one-layer models, as bias learning in overparameterized regimes redistributes biases uniformly.
- Comparison to other activations: FReX, unlike sigmoidal or softplus-type activations, is even, non-monotonic, and singular, with good performance in MNIST classification, matching or outperforming sigmoid and approaching ReLU.
- Implications for SGD: FReX’s localized kernel structure is theoretically superior for the analysis of SGD, due to reduced long-range dependence and faster decay of off-diagonal kernel terms.
Implications and Directions for Future Research
This work supplies theoretical foundations for understanding the representational and optimization properties of shallow neural networks with pre-fixed biases and particular classes of activation functions. Several implications and directions arise:
- Design criteria for activations: Second-order fundamental solution status is a mathematical criterion for universal approximation and convergence guarantees.
- Operator-theoretic learning analysis: The spectral structure of the induced kernel governs learning dynamics, providing quantifiable error decay for all frequencies.
- Potential for extending to higher dimensions: While the analysis is currently uni-dimensional, the extensions to multivariate settings (with Laplace or Helmholtz Green’s functions as activations) are a natural theoretical trajectory, albeit technically challenging.
- Connections to overparameterization and NTK theory: The exact parameterization shown here contrasts with NTK-like overparameterized networks; the transition between these regimes requires further study.
- Practical architectures: Fixing biases and using FReX-like activations could enable more efficient and theoretically grounded architectures, especially in settings where memory or parameter count is at a premium.
Conclusion
This work rigorously characterizes one-layer neural networks with fixed biases through the lens of functional analysis, making explicit the roles of the bias and activation in expressivity and learning convergence. The introduction and mathematical validation of the FReX activation, with detailed frequency-dependent convergence guarantees, represents a conceptual advance in the theoretical understanding of activation selection and training dynamics for shallow networks. This paradigm opens numerous avenues for further mathematical and practical developments in network architecture and optimization (2604.07715).