Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Published 4 Apr 2024 in cs.LG and cs.CL | (2404.03605v2)

Abstract: We consider the problem of accurate quantization for LLMs, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that LLMs contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's "migrating" the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259, 2023.
  2. Lsq+: Improving low-bit quantization through learnable offsets and better initialization, 2020.
  3. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  4. Quip: 2-bit quantization of large language models with guarantees, 2023.
  5. Robust quantization: One model to rule them all. Advances in neural information processing systems, 33:5308–5317, 2020.
  6. Pact: Parameterized clipping activation for quantized neural networks, 2018.
  7. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pp.  7750–7774. PMLR, 2023.
  8. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  9. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023.
  10. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024.
  11. Marlin: a fast 4-bit inference kernel for medium batchsizes. https://github.com/IST-DASLab/marlin, 2024.
  12. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
  13. Olmo: Accelerating the science of language models, 2024.
  14. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2704–2713, 2018.
  15. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proceedings of Machine Learning and Systems, 2:112–128, 2020.
  16. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp.  1–14. IEEE, 2021.
  17. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4350–4359, 2019.
  18. Squeezellm: Dense-and-sparse quantization, 2023.
  19. Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
  20. Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
  21. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023.
  22. Qllm: Accurate and efficient low-bitwidth quantization for large language models, 2023.
  23. Outlier dimensions that disrupt transformers are driven by frequency. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1286–1304, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.93. URL https://aclanthology.org/2022.findings-emnlp.93.
  24. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
  25. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  26. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  27. Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
  28. Outlier suppression: Pushing the limit of low-bit transformer language models. In Proceedings of NeurIPS, 2022.
  29. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling, 2023.
  30. Understanding int4 quantization for language models: Latency speedup, composability, and failure cases. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp.  37524–37539, 2023.
  31. Smoothquant: Accurate and efficient post-training quantization for large language models, 2023.
  32. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  33. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
  34. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  35. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102, 2023.
  36. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Citations (6)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.