Adaptive activation functions accelerate convergence in deep and physics-informed neural networks

Published 4 Jun 2019 in physics.comp-ph | (1906.01170v1)

Abstract: We employ adaptive activation functions for regression in deep and physics-informed neural networks (PINNs) to approximate smooth and discontinuous functions as well as solutions of linear and nonlinear partial differential equations. In particular, we solve the nonlinear Klein-Gordon equation, which has smooth solutions, the nonlinear Burgers equation, which can admit high gradient solutions, and the Helmholtz equation. We introduce a scalable hyper-parameter in the activation function, which can be optimized to achieve best performance of the network as it changes dynamically the topology of the loss function involved in the optimization process. The adaptive activation function has better learning capabilities than the traditional one (fixed activation) as it improves greatly the convergence rate, especially at early training, as well as the solution accuracy. To better understand the learning process, we plot the neural network solution in the frequency domain to examine how the network captures successively different frequency bands present in the solution. We consider both forward problems, where the approximate solutions are obtained, as well as inverse problems, where parameters involved in the governing equation are identified. Our simulation results show that the proposed method is a very simple and effective approach to increase the efficiency, robustness and accuracy of the neural network approximation of nonlinear functions as well as solutions of partial differential equations, especially for forward problems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a trainable scaling parameter in activation functions to accelerate convergence and improve approximation accuracy in both DNNs and PINNs.
It demonstrates enhanced early learning of high-frequency components and noise robustness across various PDE applications, including Burgers and Helmholtz equations.
Empirical results show significant reductions in L2 errors and faster convergence times, underscoring the practical impact of adaptive activation in scientific computing.

Accelerating Convergence in Deep and Physics-Informed Neural Networks via Adaptive Activation Functions

Introduction

The paper "Adaptive activation functions accelerate convergence in deep and physics-informed neural networks" (1906.01170) systematically investigates the impact of introducing a scalable hyper-parameter into activation functions for both standard deep neural networks (DNNs) and physics-informed neural networks (PINNs). The primary focus is on regression tasks involving smooth and discontinuous functions and solution inference, as well as parameter identification (inverse problems) for a range of linear and nonlinear PDEs, including the nonlinear Klein-Gordon, Burgers, and Helmholtz equations.

The work situates itself within a rapidly growing literature on utilizing DNNs as ansatz spaces for numerical solution of PDEs, and extends existing methodologies by providing a general, dynamically tunable mechanism for adjusting the activation function across arbitrary network depths. The authors provide empirical evidence of substantial improvements in convergence rate and approximation accuracy achieved by optimizing a scaling parameter in the activation function during training, with implications for both forward and inverse PDE problems.

Methodology

The central methodological contribution is the adaptive activation function, parameterized as $o(\eta \alpha L_k(x_{k-1}))$ , where $\alpha$ is a layer-wise (potentially global) trainable scaling parameter and $\eta \geq 1$ is a fixed multiplicative factor to modulate optimization sensitivity. This framework generalizes typical choices like fixed $\tanh$ , sigmoid, or ReLU, by embedding an additional homogeneous scaling degree of freedom into the nonlinearity, which is co-optimized alongside network weights and biases via stochastic gradient-based methods (specifically Adam). This approach judiciously sidesteps manual tuning of the activation function, enabling the network to self-adjust its nonlinearity in response to the problem's complexity and the evolving loss landscape.

For PINN-based approaches, the loss function $\mathcal{J}(\Theta)$ includes both data-driven error (MSE on observed values or BC/IC) and physics-residual terms (MSE on PDE residuals at collocation points). Both forward (solution inference) and inverse (parameter identification within the PDE) problems are handled within this loss framework. The trainable parameters $\Theta$ are thus augmented to include $\alpha$ , which, crucially, can control the slope and sensitivity of the activation function during learning, directly influencing gradient flow, conditioning, and capacity to represent spectral components of the solution.

Experimental Results

Neural Network Regression

The adaptive activation mechanism is tested for function approximation tasks involving (i) highly oscillatory smooth functions and (ii) piecewise discontinuous functions. In all cases, networks with adaptive activation achieve lower loss faster, and require fewer training steps to capture high-frequency spectral content, as demonstrated via Fourier analysis of the network outputs throughout training.

For example, for $u(x) = (x^3 - x)\sin(7x) + \sin(12x)$ defined on $x \in [-3,3]$ , the adaptive-activation model captures all relevant frequency bands in approximately $22,000$ iterations, whereas the fixed-activation model fails to capture the highest-frequency components even after comparable training. For a discontinuous function, the benefits are even more pronounced, corroborating the hypothesis that the additional trainable degree of freedom enables more rapid representation of non-smooth features.

PDE Solution: Forward and Inverse Problems

Burgers Equation

Networks with adaptive activation are compared for solving the 1D nonlinear Burgers equation with vanishing viscosity, a prototypical PDE generating steep gradients and shocks. Relative $L^2$ errors after $2,000$ iterations decrease from $1.91\times10^{-1}$ (fixed activation) to $9.52\times10^{-2}$ (adaptive, $n=10$ ). Notably, convergence is accelerated as the scaling factor increases, up to a point beyond which loss-optimality oscillates due to optimizer sensitivity. Fast frequency component learning under adaptive activation is repeatedly verified, quantitatively and in the frequency domain.

Klein-Gordon and Helmholtz Equations

Similar performance improvements are documented for the nonlinear Klein-Gordon equation and the 2D Helmholtz problem. For the Klein-Gordon case (with quadratic nonlinearity), the $L^2$ error is reduced by over $50\%$ (from $1.95\times 10^{-1}$ to $9.06\times 10^{-2}$ ) by the introduction of adaptive activation, alongside a several-fold speedup in convergence. On the Helmholtz equation, adaptive activation leads to a reduction in error ( $1.06\times10^{-1}$ to $7.19\times10^{-2}$ at $3,600$ iterations).

Inverse Problem: Sine-Gordon PDE Identification

For parametric discovery and solution identification in the 2D sine-Gordon equation under data noise, adaptive activation yields striking improvements. With $500$ collocation data points, the maximum error in identified parameters drops from nearly $9\%$ (fixed) to $1.24\%$ (adaptive), with significantly more rapid loss decay. The approach remains robust under the addition of up to $2\%$ noise, and the identified operator is consistently quantitatively closer to the ground truth. The authors provide both operator identification and solution error metrics that demonstrate adaptive activation's advantage in ill-conditioned and data-limited inverse settings.

Analysis and Implications

This study provides clear evidence that adaptive activation functions confer marked improvements in network trainability and final accuracy when approximating solutions to PDEs—particularly those with sharp gradients, discontinuities, or complex spectral content.

A key finding is that models with adaptive activation consistently learn higher-frequency components earlier in the training process (in line with but significantly enhancing observations from the Frequency Principle). This effect translates to both faster convergence in terms of epochs and lower final $L^2$ errors, observable across different activation bases (tanh, sin), problem types, and even in the presence of noise.

The introduction of an extra trainable parameter $\alpha$ does not appreciably increase model or optimization complexity, but dramatically reshapes the geometry of the loss landscape. However, care is required: excessive scaling (large $n$ ) can lead to optimizer instability, underlining the importance of properly balancing scaling sensitivity and regularization.

These results have several implications:

PINN applications to high-gradient, highly oscillatory, or discontinuous PDEs can be rendered dramatically more efficient by employing adaptive activation.
The adaptive mechanism is general and compatible with current architectures and optimizers; it does not presuppose particular problem structure.
For inverse problems, improved loss geometry via adaptive activation results in enhanced noise robustness and parameter identifiability, widening the practical applicability of neural operator discovery frameworks.
The method offers a simple, broadly applicable alternative to manual activation or architecture engineering for accelerating convergence in deep learning applied to computational physics.

Future Directions

Potential avenues for further investigation include:

Formal analysis of the impact of adaptive activation on the geometry (conditioning, sharpness) of the loss landscape and the associated optimization trajectories.
Exploration of multi-parameter and neuron-wise adaptivity within activation functions.
Integration with more advanced physics-informed frameworks, including transfer learning for PINNs, or in the context of operator learning architectures (e.g., DeepONet or Fourier Neural Operators).
Extension to large-scale, high-dimensional problems with more complex boundary conditions and data types (e.g., multi-fidelity, stochastic PDEs).
Study of adaptive activation’s interaction with various regularization and normalization schemes, especially regarding generalization capacity and robustness.

Conclusion

The introduction of adaptive activation functions with optimized scale parameters yields significant improvement in convergence rates and approximation fidelity for both DNNs and PINNs applied to regression, forward PDE, and inverse PDE problems. The results establish adaptive activation as a powerful, architecture-agnostic tool for enhancing the efficiency and robustness of neural network-based scientific computing across a wide spectrum of physical and engineering problems (1906.01170).

Markdown Report Issue