Pruning vs Quantization: Which is Better?

Published 6 Jul 2023 in cs.LG | (2307.02973v2)

Abstract: Neural network pruning and quantization techniques are almost as old as neural networks themselves. However, to date only ad-hoc comparisons between the two have been published. In this paper, we set out to answer the question on which is better: neural network quantization or pruning? By answering this question, we hope to inform design decisions made on neural network hardware going forward. We provide an extensive comparison between the two techniques for compressing deep neural networks. First, we give an analytical comparison of expected quantization and pruning error for general data distributions. Then, we provide lower bounds for the per-layer pruning and quantization error in trained networks, and compare these to empirical error after optimization. Finally, we provide an extensive experimental comparison for training 8 large-scale models on 3 tasks. Our results show that in most cases quantization outperforms pruning. Only in some scenarios with very high compression ratio, pruning might be beneficial from an accuracy standpoint.

Abstract PDF HTML Upgrade to Chat

References (64)

Citations (28)

View on Semantic Scholar

Summary

The paper finds that quantization yields superior signal-to-noise ratios in moderate compression scenarios compared to pruning.
Experimental results show quantization-aware training (LSQ) better maintains accuracy than pruning, especially under equal compression ratios.
The combined theoretical and empirical analysis demonstrates quantization as the preferred method for efficient hardware-aware neural network compression.

A Comparative Analysis of Pruning and Quantization in Neural Network Compression

The paper "Pruning vs Quantization: Which is Better?" provides a detailed inquiry into the efficiencies of pruning and quantization techniques in compressing deep neural networks (DNNs). The authors aim to delineate which of these methods proves more effective, focusing on their potential impact on neural network accuracy.

Analytical Comparisons

The paper begins with an analytical exploration of both techniques. Quantization, which reduces the bit-width for weights and computations, generally provides predictable memory savings and computational reductions. Conversely, pruning eliminates certain weights, thereby affecting both the memory footprint and computational load during inference.

Using signal-to-noise ratio (SNR) as a key metric, the authors analyze the mean-squared error (MSE) introduced by each method. This analytical framework provides a theoretical basis for understanding the underlying trade-offs. The results of these analyses suggest that quantization possesses a superior SNR in moderate compression scenarios, particularly when weights are Gaussian-like.

Experimental Evaluations

The study extends to empirical evaluations using real data from pre-trained models across various scales. A notable finding across these models is that quantization outpaces pruning in scenarios except those involving extreme compression ratios. Here, pruning might be preferable due to its distinct handling of distribution tails but at a cost to model performance, which is often prohibitive.

Post-Training and Fine-Tuning Scenarios

In post-training conditions, the paper leverages theoretical bounds, using semi-definite programming (SDP) to assess quantization errors and exact solutions for pruning in manageable scenarios. This methodology avoids biases inherent to specific optimization algorithms and offers a clearer picture of the potential effectiveness of both techniques.

Under fine-tuning conditions, the quantization-aware training (QAT) method LSQ consistently outperformed pruning methods in maintaining model accuracy across various tasks, even under equal compression ratios. Pruning primarily serves well at extremely low bit-widths, which are not typically operational due to significant accuracy drops.

Implications and Future Directions

Practically, the findings advocate for prioritizing quantization in neural network deployments where computational efficiency and accuracy are paramount. The potential of intrinsic sparsity in quantized tensors suggests additional avenues for optimization without complicating hardware requirements.

The paper hints at future areas of research, including exploring combinations of pruning and quantization. Despite the potential theoretical advantages of these combinations, further practical investigations are required to assess their feasibility and impact across diverse models and architectures.

Conclusion

This research presents a comprehensive comparison between pruning and quantization, illustrating the consistent edge that quantization holds in most practical compression scenarios. The emphasis on accurate SNR assessments and the dual focus on both theoretical and empirical analyses significantly contribute to its usefulness for hardware-aware model compression strategies. Without focusing on the hardware specifics intensely, the paper provides essential insights for researchers and practitioners looking to optimize neural network performance efficiently.

Markdown Report Issue