Noisy Activation Functions

Published 1 Mar 2016 in cs.LG, cs.NE, and stat.ML | (1603.00391v3)

Abstract: Common nonlinear activation functions used in neural networks can cause training difficulties due to the saturation behavior of the activation function, which may hide dependencies that are not visible to vanilla-SGD (using first order gradients only). Gating mechanisms that use softly saturating activation functions to emulate the discrete switching of digital logic circuits are good examples of this. We propose to exploit the injection of appropriate noise so that the gradients may flow easily, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent toexplore more. By adding noise only to the problematic parts of the activation function, we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by noisy variants helps training in many contexts, yielding state-of-the-art or competitive results on different datasets and task, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (269)

View on Semantic Scholar

Summary

The paper introduces a novel method of injecting noise into activation functions to alleviate the vanishing gradient problem.
It demonstrates improved performance across tasks like machine translation, language modeling, and image captioning, with gains such as over 2 BLEU points.
The method enables smoother optimization and faster convergence by replacing traditional saturating functions with their noisy counterparts.

An Analysis of Noisy Activation Functions in Neural Networks

The research paper titled "Noisy Activation Functions" presents a methodological investigation into the integration of noise into activation functions within neural networks. The crux of this approach is to enhance the training efficiency and performance of networks by addressing the often problematic saturation behavior of conventional nonlinear activation functions.

Motivation and Proposed Method

Highly saturating activation functions such as sigmoid and tanh frequently encounter challenges with vanishing gradients, which curtail effective training via gradient-based optimization methods like stochastic gradient descent (SGD). The paper introduces a novel strategy by injecting noise into these saturated activation functions, specifically when their gradients diminish to zero. This noise injection is proposed as a mechanism to allow gradients to propagate, thereby facilitating the exploration of the optimization landscape.

The introduced technique is framed as a form of annealing, akin to simulated annealing, where the magnitude of noise is gradually reduced over time. This process allows for initial, broad exploration behavior due to larger noise, which is tapered to fine-tuning as the noise decreases. The methodology implicitly connects to continuation methods and seeks to leverage the benefits of softening in the optimization phase while preserving deterministic decision-making at the time of inference.

Experimental Evidence and Results

Experimental validation of this approach was conducted across several domains including machine translation, language modeling, and image caption generation. Notably, the paper reports competitive, and at times superior, results against the state-of-the-art. For instance, in the context of neural machine translation on the Europarl dataset, the introduction of noisy activations yielded a notable improvement of over 2 BLEU points—a significant gain for this benchmark.

Furthermore, the experiments confirm that substituting conventional saturating functions with their noisy counterparts can substantially improve model convergence rates and final accuracy in challenging tasks, as demonstrated in the "Learning to Execute" task.

Theoretical and Practical Implications

The incorporation of noise into activation functions facilitates smoother optimization in tasks involving complex non-linear decision-making processes without explicitly altering the network's topology or the broader training regime. The technique advances the theoretical understanding of the balance between exploration and exploitation in SGD-driven learning processes in high-dimensional spaces.

From a practical perspective, the method offers a straightforward strategy for improving existing architectures: it merely requires replacing standard activation functions with noisy variants. This simplicity in implementation is an attractive aspect for practitioners looking to enhance model performance with minimal overhead.

Future Directions

This research opens several pathways for future exploration. Further work could extend the investigation to additional types of noise distributions and evaluate the potential of dynamically adaptive noise parameters in response to the learning process. Additionally, exploring the integration of noisy activations within other paradigms of neural networks, including convolutional and graph neural networks, may provide insights into broader applicability and benefits.

In sum, the paper makes a compelling case for the utility of noisy activation functions as a straightforward yet effective means to overcome saturation-induced optimization challenges. By doing so, it contributes to the ongoing effort to enhance the robustness and efficiency of neural network training, with implications that reverberate across both theoretical and applied domains in AI research.

Markdown Report Issue