Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations

Published 9 Apr 2026 in cs.LG and math.OC | (2604.07715v1)

Abstract: We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the $L^2$ squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that one-layer neural networks with fixed biases achieve universal L2 approximation under suitable activation functions.
It rigorously proves gradient descent convergence and quantifies spectral bias using operator-theoretic methods for both ReLU and the proposed FReX activation.
The study introduces FReX as a mathematically justified alternative to ReLU, offering faster spectral decay and improved localized kernel properties.

Mathematical Analysis of One-Layer Neural Networks with Fixed Biases and a New Activation Function

Overview

This paper provides a rigorous functional analytic study of one-hidden-layer neural networks with fixed biases in both continuous and discretized settings, targeting one-dimensional function approximation with the ReLU activation. It delivers detailed convergence proofs for gradient descent with quadratic loss, a spectral bias analysis grounded in the operator spectrum, and deduces precise criteria for activation function suitability, ultimately proposing and analyzing the "full-wave rectified exponential" (FReX) as an alternative to ReLU. Both empirical and spectral properties of these models are elucidated.

Model Formulation and Main Reductions

The authors consider two primary models:

Continuous model: The standard two-layer neural network is shown, via a change of variables and parameter absorption, to reduce to a one-layer network with fixed first-layer weights and biases:

$f(x) = \int_{-\infty}^\infty w(z) \sigma(x-z) dz + b + cx,$

where $\sigma$ is an activation (initially ReLU), $w(z)$ is a trainable second-layer weighting function, and $b,\,c$ are bias and linear parameters.

Discrete model: The input domain $[0,1]$ is partitioned into $N$ intervals, with corresponding weights $w(t_j)$ and prescribed biases. The function space $\mathfrak{X}$ comprises continuous, piecewise-linear functions on the grid. The network representation involves summation over shifted activations.

The main theoretical reduction demonstrates that in the continuous setting, the network with prescribed first-layer biases and a single trainable weighting function is sufficient for universal $L^2$ approximation (on $[0,1]$ ), provided the activation function is a fundamental solution to an appropriate second-order operator.

Rigorous Gradient Descent Convergence

Leveraging the linear dependence of network outputs on parameters and using operator-theoretic methods, the paper proves strong convergence of full-batch gradient descent for quadratic loss, both in the continuous and discrete formulations:

Continuous: For target functions $\sigma$ 0, iterates $\sigma$ 1 under gradient descent converge strongly (in norm) to the unique minimizer representing $\sigma$ 2 via the neural network. Smoothness of $\sigma$ 3 affects the convergence rate; for $\sigma$ 4, the error decays as $\sigma$ 5.
Discrete: In the discretized setting on a grid, analogous convergence results hold. The critical operator is finite-dimensional, enabling spectral gap arguments and uniform operator-norm decay bounds.

The proofs utilize the self-adjointness and positivity of the induced integral operator $\sigma$ 6 (resp. its finite-dimensional counterpart), together with the compactness properties of convolution with the activation.

Spectral Bias and Frequency Principle

The authors conduct a thorough operator spectral analysis, quantifying the "spectral bias" in learning:

The error after $\sigma$ 7 iterations decomposes as

$\sigma$ 8

where $\sigma$ 9 and $w(z)$ 0 are the eigenvalues and eigenfunctions of the network operator.

For the ReLU network, the relevant integral operator $w(z)$ 1 inverts a fourth-order differential operator, yielding $w(z)$ 2, so learned frequencies scale as $w(z)$ 3 with iteration count. This aligns precisely with observed frequency dynamics in neural network training literature (the so-called "frequency principle").

The frequency-wise learning rates are thus determined by the spectral decay of the associated functional-analytic operator.

Activation Function Properties: From ReLU to FReX

A significant conceptual outcome is the identification of a key mathematical property for activation functions: being the fundamental solution to a second-order differential operator. For ReLU,

$w(z)$ 4

implying any $w(z)$ 5 function can be represented, and the network is exactly parameterized (i.e., not overparameterized as with standard wide networks).

The paper proposes the FReX activation,

$w(z)$ 6

which is the Green's function for $w(z)$ 7. Critically, FReX is continuous, non-differentiable at zero (mirroring ReLU), but exponentially decaying and integrable, yielding a localized kernel with faster spectral decay.

Analysis of the corresponding network yields similar representability and convergence properties. The spectral bias calculations can be performed explicitly in Fourier space: learning high-frequency components is slow as $w(z)$ 8 grows with increasing $w(z)$ 9, but the influence is more localized due to FReX’s decay.

Empirical and Structural Observations

Bias parameterization: The analysis and empirical investigations suggest that fixed, regularly distributed biases suffice for expressivity in these one-layer models, as bias learning in overparameterized regimes redistributes biases uniformly.
Comparison to other activations: FReX, unlike sigmoidal or softplus-type activations, is even, non-monotonic, and singular, with good performance in MNIST classification, matching or outperforming sigmoid and approaching ReLU.
Implications for SGD: FReX’s localized kernel structure is theoretically superior for the analysis of SGD, due to reduced long-range dependence and faster decay of off-diagonal kernel terms.

Implications and Directions for Future Research

This work supplies theoretical foundations for understanding the representational and optimization properties of shallow neural networks with pre-fixed biases and particular classes of activation functions. Several implications and directions arise:

Design criteria for activations: Second-order fundamental solution status is a mathematical criterion for universal approximation and convergence guarantees.
Operator-theoretic learning analysis: The spectral structure of the induced kernel governs learning dynamics, providing quantifiable error decay for all frequencies.
Potential for extending to higher dimensions: While the analysis is currently uni-dimensional, the extensions to multivariate settings (with Laplace or Helmholtz Green’s functions as activations) are a natural theoretical trajectory, albeit technically challenging.
Connections to overparameterization and NTK theory: The exact parameterization shown here contrasts with NTK-like overparameterized networks; the transition between these regimes requires further study.
Practical architectures: Fixing biases and using FReX-like activations could enable more efficient and theoretically grounded architectures, especially in settings where memory or parameter count is at a premium.

Conclusion

This work rigorously characterizes one-layer neural networks with fixed biases through the lens of functional analysis, making explicit the roles of the bias and activation in expressivity and learning convergence. The introduction and mathematical validation of the FReX activation, with detailed frequency-dependent convergence guarantees, represents a conceptual advance in the theoretical understanding of activation selection and training dynamics for shallow networks. This paradigm opens numerous avenues for further mathematical and practical developments in network architecture and optimization (2604.07715).

Markdown Report Issue