Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift

Published 16 Jan 2018 in cs.LG and stat.ML | (1801.05134v1)

Abstract: This paper first answers the question "why do the two most powerful techniques Dropout and Batch Normalization (BN) often lead to a worse performance when they are combined together?" in both theoretical and statistical aspects. Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test. However, BN would maintain its statistical variance, which is accumulated from the entire learning procedure, in the test phase. The inconsistency of that variance (we name this scheme as "variance shift") causes the unstable numerical behavior in inference that leads to more erroneous predictions finally, when applying Dropout before BN. Thorough experiments on DenseNet, ResNet, ResNeXt and Wide ResNet confirm our findings. According to the uncovered mechanism, we next explore several strategies that modifies Dropout and try to overcome the limitations of their combination by avoiding the variance shift risks.

Abstract PDF Upgrade to Chat

Citations (296)

View on Semantic Scholar

Summary

The paper identifies a 'variance shift' that disrupts neural network performance when combining Dropout with Batch Normalization.
It employs robust theoretical analysis and experiments on models like DenseNet, ResNet, and Wide ResNet to confirm the phenomenon.
The study proposes mitigation techniques, including repositioning Dropout layers and a variance-stable Dropout, to enhance model accuracy.

Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift

The paper entitled "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift" presents a rigorous investigation into the observed performance degradation when utilizing Dropout and Batch Normalization (BN) concurrently in neural networks. The authors navigate this issue using a theoretical framework supported by empirical analysis, focusing on the phenomenon they term "variance shift."

Core Findings

A central thesis of the paper is the identification of "variance shift," which occurs when Dropout and BN are employed sequentially within a neural network. Dropout, during the transition from training to testing, scales the variance of neural responses inversely to the retain ratio. This adjustment is inconsistent with the statistical variance stabilization performed by BN, which employs accumulated statistics (mean and variance) during inference—a process designed under the assumption of a stable variance distribution. The mismatch in variance conceptualized as "variance shift" manifests as numerical instability at inference, leading to erroneous predictions.

Experimental Verification

The authors substantiate their theoretical propositions through a series of methodical experiments on representative architectures such as DenseNet, ResNet, ResNeXt, and Wide ResNet. Empirical results verify the variance shift and its detrimental impact on the performance, demonstrated by elevated error rates across various settings. Notably, the experiments exhibit that network architectures with greater channel dimensions (e.g., Wide ResNet) exhibit a reduced impact from variance shift, likely due to their inherent structural properties that mitigate variance propagation.

Proposed Mitigations

In response to the identified problem, the authors propose two effective mitigation strategies. Firstly, they suggest repositioning Dropout layers to occur after all BN layers, rather than before, thereby eliminating the direct variance interaction between these components. Secondly, they propose a variance-stable form of Dropout, which retains the regularization benefits while minimizing the variance shift. Both strategies result in measurable improvements in model accuracy, corroborating the theoretical solutions with experimental evidence.

Implications and Future Directions

The findings outlined in this paper have significant implications for neural network design, particularly in the integration of regularization techniques in conjunction with standardized normalization processes. The research highlights the necessity for a nuanced understanding and application of common machine learning strategies to avert unintended performance penalties. Future research could further explore adaptive strategies for dynamic variance adjustment tailored to specific architectures or consider a hybrid approach that leverages the unique advantages of both Dropout and BN in unified models.

Conclusion

The paper provides a meticulous analysis of the interaction between Dropout and BN, offering insightful resolutions to the challenges associated with their combined use. This work contributes substantially to enhancing the robustness and efficacy of deep neural networks—a cornerstone technology in modern AI-driven applications. The proposed methodologies not only offer practical solutions but also pave the way for further inquiries into optimizing model training and generalization capabilities.