- The paper introduces a trainable scaling parameter in activation functions to accelerate convergence and improve approximation accuracy in both DNNs and PINNs.
- It demonstrates enhanced early learning of high-frequency components and noise robustness across various PDE applications, including Burgers and Helmholtz equations.
- Empirical results show significant reductions in L2 errors and faster convergence times, underscoring the practical impact of adaptive activation in scientific computing.
Introduction
The paper "Adaptive activation functions accelerate convergence in deep and physics-informed neural networks" (1906.01170) systematically investigates the impact of introducing a scalable hyper-parameter into activation functions for both standard deep neural networks (DNNs) and physics-informed neural networks (PINNs). The primary focus is on regression tasks involving smooth and discontinuous functions and solution inference, as well as parameter identification (inverse problems) for a range of linear and nonlinear PDEs, including the nonlinear Klein-Gordon, Burgers, and Helmholtz equations.
The work situates itself within a rapidly growing literature on utilizing DNNs as ansatz spaces for numerical solution of PDEs, and extends existing methodologies by providing a general, dynamically tunable mechanism for adjusting the activation function across arbitrary network depths. The authors provide empirical evidence of substantial improvements in convergence rate and approximation accuracy achieved by optimizing a scaling parameter in the activation function during training, with implications for both forward and inverse PDE problems.
Methodology
The central methodological contribution is the adaptive activation function, parameterized as o(ηαLk(xk−1)), where α is a layer-wise (potentially global) trainable scaling parameter and η≥1 is a fixed multiplicative factor to modulate optimization sensitivity. This framework generalizes typical choices like fixed tanh, sigmoid, or ReLU, by embedding an additional homogeneous scaling degree of freedom into the nonlinearity, which is co-optimized alongside network weights and biases via stochastic gradient-based methods (specifically Adam). This approach judiciously sidesteps manual tuning of the activation function, enabling the network to self-adjust its nonlinearity in response to the problem's complexity and the evolving loss landscape.
For PINN-based approaches, the loss function J(Θ) includes both data-driven error (MSE on observed values or BC/IC) and physics-residual terms (MSE on PDE residuals at collocation points). Both forward (solution inference) and inverse (parameter identification within the PDE) problems are handled within this loss framework. The trainable parameters Θ are thus augmented to include α, which, crucially, can control the slope and sensitivity of the activation function during learning, directly influencing gradient flow, conditioning, and capacity to represent spectral components of the solution.
Experimental Results
Neural Network Regression
The adaptive activation mechanism is tested for function approximation tasks involving (i) highly oscillatory smooth functions and (ii) piecewise discontinuous functions. In all cases, networks with adaptive activation achieve lower loss faster, and require fewer training steps to capture high-frequency spectral content, as demonstrated via Fourier analysis of the network outputs throughout training.
For example, for u(x)=(x3−x)sin(7x)+sin(12x) defined on x∈[−3,3], the adaptive-activation model captures all relevant frequency bands in approximately $22,000$ iterations, whereas the fixed-activation model fails to capture the highest-frequency components even after comparable training. For a discontinuous function, the benefits are even more pronounced, corroborating the hypothesis that the additional trainable degree of freedom enables more rapid representation of non-smooth features.
PDE Solution: Forward and Inverse Problems
Burgers Equation
Networks with adaptive activation are compared for solving the 1D nonlinear Burgers equation with vanishing viscosity, a prototypical PDE generating steep gradients and shocks. Relative L2 errors after $2,000$ iterations decrease from 1.91×10−1 (fixed activation) to 9.52×10−2 (adaptive, n=10). Notably, convergence is accelerated as the scaling factor increases, up to a point beyond which loss-optimality oscillates due to optimizer sensitivity. Fast frequency component learning under adaptive activation is repeatedly verified, quantitatively and in the frequency domain.
Klein-Gordon and Helmholtz Equations
Similar performance improvements are documented for the nonlinear Klein-Gordon equation and the 2D Helmholtz problem. For the Klein-Gordon case (with quadratic nonlinearity), the L2 error is reduced by over 50% (from 1.95×10−1 to 9.06×10−2) by the introduction of adaptive activation, alongside a several-fold speedup in convergence. On the Helmholtz equation, adaptive activation leads to a reduction in error (1.06×10−1 to 7.19×10−2 at $3,600$ iterations).
Inverse Problem: Sine-Gordon PDE Identification
For parametric discovery and solution identification in the 2D sine-Gordon equation under data noise, adaptive activation yields striking improvements. With $500$ collocation data points, the maximum error in identified parameters drops from nearly 9% (fixed) to 1.24% (adaptive), with significantly more rapid loss decay. The approach remains robust under the addition of up to 2% noise, and the identified operator is consistently quantitatively closer to the ground truth. The authors provide both operator identification and solution error metrics that demonstrate adaptive activation's advantage in ill-conditioned and data-limited inverse settings.
Analysis and Implications
This study provides clear evidence that adaptive activation functions confer marked improvements in network trainability and final accuracy when approximating solutions to PDEs—particularly those with sharp gradients, discontinuities, or complex spectral content.
A key finding is that models with adaptive activation consistently learn higher-frequency components earlier in the training process (in line with but significantly enhancing observations from the Frequency Principle). This effect translates to both faster convergence in terms of epochs and lower final L2 errors, observable across different activation bases (tanh, sin), problem types, and even in the presence of noise.
The introduction of an extra trainable parameter α does not appreciably increase model or optimization complexity, but dramatically reshapes the geometry of the loss landscape. However, care is required: excessive scaling (large n) can lead to optimizer instability, underlining the importance of properly balancing scaling sensitivity and regularization.
These results have several implications:
- PINN applications to high-gradient, highly oscillatory, or discontinuous PDEs can be rendered dramatically more efficient by employing adaptive activation.
- The adaptive mechanism is general and compatible with current architectures and optimizers; it does not presuppose particular problem structure.
- For inverse problems, improved loss geometry via adaptive activation results in enhanced noise robustness and parameter identifiability, widening the practical applicability of neural operator discovery frameworks.
- The method offers a simple, broadly applicable alternative to manual activation or architecture engineering for accelerating convergence in deep learning applied to computational physics.
Future Directions
Potential avenues for further investigation include:
- Formal analysis of the impact of adaptive activation on the geometry (conditioning, sharpness) of the loss landscape and the associated optimization trajectories.
- Exploration of multi-parameter and neuron-wise adaptivity within activation functions.
- Integration with more advanced physics-informed frameworks, including transfer learning for PINNs, or in the context of operator learning architectures (e.g., DeepONet or Fourier Neural Operators).
- Extension to large-scale, high-dimensional problems with more complex boundary conditions and data types (e.g., multi-fidelity, stochastic PDEs).
- Study of adaptive activation’s interaction with various regularization and normalization schemes, especially regarding generalization capacity and robustness.
Conclusion
The introduction of adaptive activation functions with optimized scale parameters yields significant improvement in convergence rates and approximation fidelity for both DNNs and PINNs applied to regression, forward PDE, and inverse PDE problems. The results establish adaptive activation as a powerful, architecture-agnostic tool for enhancing the efficiency and robustness of neural network-based scientific computing across a wide spectrum of physical and engineering problems (1906.01170).