Papers
Topics
Authors
Recent
Search
2000 character limit reached

Smooth activations and reproducibility in deep networks

Published 20 Oct 2020 in cs.LG, cs.NE, and stat.ML | (2010.09931v2)

Abstract: Deep networks are gradually penetrating almost every domain in our lives due to their amazing success. However, with substantive performance accuracy improvements comes the price of \emph{irreproducibility}. Two identical models, trained on the exact same training dataset may exhibit large differences in predictions on individual examples even when average accuracy is similar, especially when trained on highly distributed parallel systems. The popular Rectified Linear Unit (ReLU) activation has been key to recent success of deep networks. We demonstrate, however, that ReLU is also a catalyzer to irreproducibility in deep networks. We show that not only can activations smoother than ReLU provide better accuracy, but they can also provide better accuracy-reproducibility tradeoffs. We propose a new family of activations; Smooth ReLU (\emph{SmeLU}), designed to give such better tradeoffs, while also keeping the mathematical expression simple, and thus implementation cheap. SmeLU is monotonic, mimics ReLU, while providing continuous gradients, yielding better reproducibility. We generalize SmeLU to give even more flexibility and then demonstrate that SmeLU and its generalized form are special cases of a more general methodology of REctified Smooth Continuous Unit (RESCU) activations. Empirical results demonstrate the superior accuracy-reproducibility tradeoffs with smooth activations, SmeLU in particular.

Citations (20)

Summary

  • The paper proposes SmeLU, a novel activation function with a quadratic transitional segment for smoother gradients and enhanced reproducibility.
  • The paper demonstrates through empirical tests that SmeLU significantly lowers prediction difference metrics compared to ReLU across diverse datasets.
  • The improved reproducibility with minimal computational overhead has practical implications for deploying reliable deep learning models in sensitive applications.

Smooth Activations and Reproducibility in Deep Networks

Introduction

The paper "Smooth Activations and Reproducibility in Deep Networks" presents a detailed study of how conventional activation functions, specifically the Rectified Linear Unit (ReLU), contribute to the irreproducibility problem in deep learning models. The authors propose a new family of activation functions, Smooth ReLU (SmeLU), that not only improve predictive accuracy but also enhance reproducibility by ensuring smoother gradient transitions. SmeLU and its generalization aim to offer a better trade-off between accuracy and reproducibility when deploying models in distributed systems.

Problem Statement

Deep networks have shown significant success across various domains, often outperforming classical methods. However, this performance gain comes with a cost: model irreproducibility. When models are retrained with the same data and configuration, the results can vary significantly at an individual prediction level. This issue is critical in sensitive applications like medical diagnoses or online learning environments, where consistent results are crucial.

ReLU activations, despite their efficacy in achieving convergence, exacerbate the irreproducibility due to their non-smooth gradient changes, which can direct optimization processes towards divergent local minima dependent on data order and initialization.

Proposed Solution: SmeLU Activation Function

The authors introduce the SmeLU activation function, which is designed to mimic the simplicity of ReLU while ensuring a smooth and continuous gradient. SmeLU is characterized by a quadratic transitional region, making it inherently smoother and more reproducible. It consists of three linear segments and a quadratic segment that ensures continuity in both function and gradient.

The general SmeLU formulation allows flexibility in defining the shape of the activation function through parameters {α,β,g−,g+,t}\{\alpha, \beta, g_-, g_+, t\}, where α\alpha and β\beta define transition regions, and g−g_- and g+g_+ define gradients on either side of this region. This generalization allows for the creation of even more sophisticated activations under the RESCU framework.

Experimental Evaluation

Prediction Difference (PD) Metrics: The authors measure model irreproducibility using PD metrics across multiple models. The PD is calculated based on differences in predictions between models trained with identical setups but with varied training order or initialization.

Empirical Results: Experiments on the Criteo Display Advertising Challenge dataset and a proprietary large-scale advertisement dataset show that SmeLU consistently yields lower PD values compared to ReLU, demonstrating enhanced reproducibility. The results indicate that SmeLU reduces prediction variance without compromising accuracy.

Comparative Analysis: Compared to other smooth activation functions such as Swish, GELU, Mish, and TanhExp, SmeLU offers a better trade-off between computational efficiency and model performance. While these functions ensure continuity and differentiability, their computational complexity can limit hardware deployment, unlike SmeLU's straightforward implementation.

Implications and Conclusion

The introduction of SmeLU highlights the importance of smooth activations in addressing the reproducibility issues prevalent with ReLU. This advancement is particularly significant for applications requiring consistency and reliability in predictions. Additionally, the minimal computational overhead makes SmeLU suitable for deployment across a range of hardware environments.

The findings encourage future exploration into further generalizing activation functions within the RESCU framework, potentially leading to new activations that could cater to specific domain requirements. The study not only contributes to better model reliability but also opens new avenues for reducing the computational burden without sacrificing performance.

Overall, SmeLU and its theoretical underpinnings make a compelling case for rethinking the choice of activation functions in modern deep learning architectures to balance the dual objectives of performance and predictability.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.