Neural Network Compression Framework for fast model inference

Published 20 Feb 2020 in cs.CV and eess.IV | (2002.08679v4)

Abstract: In this work we present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF). It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. These methods allow getting more hardware-friendly models which can be efficiently run on general-purpose hardware computation units (CPU, GPU) or special Deep Learning accelerators. We show that the developed methods can be successfully applied to a wide range of models to accelerate the inference time while keeping the original accuracy. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations. Currently, a PyTorch version of NNCF is available as a part of OpenVINO Training Extensions at https://github.com/openvinotoolkit/nncf.

Abstract PDF Upgrade to Chat

Citations (33)

View on Semantic Scholar

Summary

The paper introduces a compression framework that employs quantization, binarization, and pruning to optimize DNN inference.
It achieves up to 3.11x faster INT8 inference while maintaining accuracy within 1% using mixed-precision techniques.
The framework integrates seamlessly with PyTorch and ONNX, making it ideal for deployment on mobile and embedded systems.

Neural Network Compression Framework for Fast Model Inference

This paper introduces the Neural Network Compression Framework (NNCF), a PyTorch-based tool engineered to enhance neural network efficiency through various compression techniques. With the escalated computational requirements of deep neural networks (DNNs), this framework seeks to facilitate faster model inferences, particularly on resource-constrained hardware, by implementing quantization, sparsity, filter pruning, and binarization.

Framework Features

The authors highlight several key features of NNCF:

Quantization: Both symmetric and asymmetric quantization schemes are supported, with optional mixed-precision strategies. The framework enables automatic fake quantization insertion into the model graph, aiding the preservation of accuracy while optimizing model performance.
Binarization: Binarization of weights and activations is supported, leveraging techniques like XNOR and DoReFa, achieving a significant reduction in model complexity albeit with some accuracy trade-offs.
Sparsity and Pruning: Methods for both magnitude-based and regularization-based sparsification are implemented, capable of preserving accuracy while reducing network complexity. Filter pruning is also integrated, allowing the removal of less salient filters to streamline model execution.
Model Transformation and Stacking: NNCF performs automatic model transformation by inserting compression layers and supports stacking of multiple compression methods to achieve compounded benefits.

Numerical Results and Claims

Strong numerical results were presented across diverse model types and use cases, including image classification, object detection, and natural language processing:

INT8 quantization achieved up to 3.11x speed improvements with negligible accuracy drops across well-known models like ResNet50 and MobileNet variations.
Mixed-precision quantization showed promise in preserving accuracy within 1% of full precision, indicating a viable pathway for applications demanding extreme inference efficiency.
When combining sparsity and quantization, the framework consistently produced models with competitive accuracy while enhancing runtime efficiency.

Practical Implications

The practical implications of this work are significant in domains where reduced model latency and size directly contribute to better system performance, such as in mobile or embedded devices. By integrating seamlessly with existing PyTorch codebases and supporting export to ONNX for subsequent inference via OpenVINO, NNCF provides a comprehensive solution for deploying compressed models in real-world applications.

Theoretical Implications

On a theoretical level, the amalgamation of several compression techniques and the capacity to stack these within a single framework may inspire new research directions investigating the interplay and optimal configuration of different compression strategies. Additionally, the alignment of compression methodologies with hardware-specific capabilities (e.g., fixed-point arithmetic) raises important considerations for architecture design and optimization.

Future Directions

Future expansions of NNCF might include refining algorithms for ultra-low precision quantization, extending model compatibility, or incorporating real-time learning schemes adaptive to dynamically changing hardware conditions. Furthermore, as AI models become more pervasive, exploring automated or AI-driven compression strategy selection could enhance usability and model deployment effectiveness further.

In conclusion, the NNCF framework presents a robust toolset for neural network compression, effectively balancing performance gains with accuracy maintenance, therefore marking an essential resource for researchers and practitioners aiming to optimize DNN inference.