DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Published 20 Jun 2016 in cs.NE and cs.LG | (1606.06160v3)

Abstract: We propose DoReFa-Net, a method to train convolutional neural networks that have low bitwidth weights and activations using low bitwidth parameter gradients. In particular, during backward pass, parameter gradients are stochastically quantized to low bitwidth numbers before being propagated to convolutional layers. As convolutions during forward/backward passes can now operate on low bitwidth weights and activations/gradients respectively, DoReFa-Net can use bit convolution kernels to accelerate both training and inference. Moreover, as bit convolutions can be efficiently implemented on CPU, FPGA, ASIC and GPU, DoReFa-Net opens the way to accelerate training of low bitwidth neural network on these hardware. Our experiments on SVHN and ImageNet datasets prove that DoReFa-Net can achieve comparable prediction accuracy as 32-bit counterparts. For example, a DoReFa-Net derived from AlexNet that has 1-bit weights, 2-bit activations, can be trained from scratch using 6-bit gradients to get 46.1\% top-1 accuracy on ImageNet validation set. The DoReFa-Net AlexNet model is released publicly.

Abstract PDF Upgrade to Chat

Citations (2,000)

View on Semantic Scholar

Summary

The paper introduces a novel quantization strategy that trains CNNs with low bitwidth weights, activations, and gradients.
It employs bit convolution kernels and a straight-through estimator to enable efficient fixed-point computation during both forward and backward passes.
Experimental results on SVHN and ImageNet demonstrate that low bitwidth configurations achieve competitive accuracy with reduced computational and memory requirements.

Analyzing DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

The paper "DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients" by Shuchang Zhou et al. discusses an innovative approach to training Convolutional Neural Networks (CNNs) using low bitwidth weights, activations, and gradients. The approach primarily aims to address the computational inefficiency and high data storage requirements of contemporary deep learning models, which are significant bottlenecks in deploying these models in resource-constrained environments such as embedded systems.

Methodology and Contributions

Low Bitwidth Weights, Activations, and Gradients:
- The authors extend the concept of Binarized Neural Networks (BNN) by generalizing the quantization process to weights, activations, and, notably, gradients. They introduce a method that allows gradients to be quantized to lower bitwidth during backpropagation, ensuring that both the forward and backward passes can be effectively accelerated using low bitwidth operations.
Bit Convolution Kernels:
- Leveraging bit convolution kernels, DoReFa-Net can perform convolutions using fixed-point integers, significantly reducing computational complexity. The paper presents mathematical formulations to enable dot products and convolutions using low bitwidth fixed-point integers.
Straight-Through Estimator (STE):
- The implementation utilizes STEs for the quantization of continuous functions, a key to circumventing the inherent challenge of zero gradients in discrete approximations.
Experiments and Performance:
- Experimental validation on SVHN and ImageNet datasets demonstrates that DoReFa-Net can achieve prediction accuracies close to 32-bit models while reducing the computational and storage overhead. Notably, the DoReFa-Net version of AlexNet achieves 46.1% top-1 accuracy on the ImageNet validation set with 1-bit weights and 2-bit activations.

Experimental Insights

The exploration of the configuration space of bitwidths for weights, activations, and gradients reveals several key observations:

Sensitivity: Gradients are more sensitive to bitwidth reductions than weights and activations. This necessitates careful balancing between the accuracy and computational savings.
Model Complexity: More complex models with higher channel counts are less sensitive to the bitwidth reduction of weights and activations.
Initialization: Models initialized with pre-trained 32-bit weights outperform those trained from scratch under low bitwidth configurations, highlighting an area for potential performance enhancement.

Implications and Future Work

The implementation of low bitwidth CNNs has profound implications:

Practical Deployment:
- Models like DoReFa-Net are highly relevant for applications in edge computing and Internet of Things (IoT) devices, where computational resources and memory are limited.
- The potential reduction in energy consumption makes such models attractive for deploying on battery-powered devices.
Theoretical Contributions:
- The quantization methodologies and STE formulations contribute significantly to the understanding of training neural networks under constrained precision.

For future research, optimizing FPGA implementations for DoReFa-Net is suggested as a promising direction, given the resource efficiency of low bitwidth computation on such hardware.

Summary

In summary, this paper presents a robust methodology to train low bitwidth CNNs with low bitwidth gradients, addressing both the forward and backward passes. The DoReFa-Net models maintain competitive accuracy on large datasets while significantly reducing computational complexity and memory footprint. This work opens new avenues for deploying efficient neural networks in real-world scenarios where resource constraints are a critical consideration.