Papers
Topics
Authors
Recent
Search
2000 character limit reached

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Published 4 Nov 2024 in cs.LG and cs.AI | (2411.02355v3)

Abstract: Quantization is a powerful tool for accelerating LLM inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Towards end-to-end 4-bit inference on generative large language models. arXiv preprint arXiv:2310.09259, 2023.
  2. Quarot: Outlier-free 4-bit inference in rotated llms, 2024. URL https://arxiv.org/abs/2404.00456.
  3. Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
  4. Quip: 2-bit quantization of large language models with guarantees, 2023.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. Evaluating large language models trained on code. 2021.
  7. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. FlashAttention: Fast and memory-efficient exact attention with io-awareness. arXiv preprint arXiv:2205.14135, 2022.
  11. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  12. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
  13. SpQR: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023.
  14. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  15. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024.
  16. FlashInfer, Z. Y. Kernel library for llm serving, 2023.
  17. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard, 2024.
  18. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  19. Marlin: Mixed-precision auto-regressive parallel inference on large language models. arXiv preprint arXiv:2408.11743, 2024.
  20. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  21. Llmc: Benchmarking large language model quantization with a versatile compression toolkit, 2024a. URL https://arxiv.org/abs/2405.06001.
  22. What makes quantization for large language model hard? an empirical study from the lens of perturbation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  18082–18089, 2024b.
  23. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  24. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103.03874.
  25. How good are low-bit quantized llama3 models? an empirical study, 2024.
  26. HuggingFace. Text generation inference (tgi), 2024. URL https://huggingface.co/docs/text-generation-inference/en/index.
  27. Karpathy, A. Tweet about training neural networks, 2024. URL https://x.com/karpathy/status/1822839061574553945. [Tweet].
  28. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  29. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  30. Lambda Labs. Lambda labs gpu cloud, 2024. URL https://lambdalabs.com/service/gpu-cloud. Accessed: 2024-10-28.
  31. Platypus: Quick, cheap, and powerful refinement of llms. 2023.
  32. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models, 2024a.
  33. A comprehensive evaluation of quantized instruction-tuned large language models: An experimental analysis up to 405b. arXiv preprint arXiv:2409.11055, 2024b.
  34. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  35. Evaluating quantized large language models. arXiv preprint arXiv:2402.18158, 2024a.
  36. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024b.
  37. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024c. URL https://lmsys.org/blog/2024-04-19-arena-hard/.
  38. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  39. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024a.
  40. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  41. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024b.
  42. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
  43. Do emergent abilities exist in quantized large language models: An empirical study. arXiv preprint arXiv:2307.08072, 2023b.
  44. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024. URL https://arxiv.org/abs/2407.14679.
  45. Neural Magic, I. Guidellm: Scalable inference and optimization for large language models. https://github.com/neuralmagic/guidellm, 2024.
  46. NVIDIA. TensorRT-LLM: TensorRT Large Language Model, 2023. URL https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2024-10-27.
  47. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  48. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  49. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
  50. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  51. Musr: Testing the limits of chain-of-thought with multistep soft reasoning, 2024. URL https://arxiv.org/abs/2310.16049.
  52. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261.
  53. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024a. URL https://arxiv.org/abs/2402.04396.
  54. Qtip: Quantization with trellises and incoherence processing. arXiv preprint arXiv:2406.11235, 2024b.
  55. Gptvq: The blessing of dimensionality for llm quantization. arXiv preprint arXiv:2402.15319, 2024.
  56. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  57. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574.
  58. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  59. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  60. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  61. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation. arXiv preprint arXiv:2303.08302, 2023.
  62. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  63. Zhang, J. Y. Tweet about machine learning quantization accuracy drops, 2024. URL https://x.com/zjasper666/status/1829259315599045063. [Tweet].
  64. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  65. Qqq: Quality quattuor-bit quantization for large language models. arXiv preprint arXiv:2406.09904, 2024.
  66. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911.

Summary

  • The paper demonstrates that FP8 quantization retains lossless accuracy across various LLM scales while reducing operational requirements.
  • It shows that properly tuned INT8 quantization incurs only a 1-3% accuracy loss, validating its practical viability for inference tasks.
  • The study reveals that INT4 weight-only quantization delivers competitive performance, offering cost-efficiency and adaptable deployment on diverse GPU frameworks.

Performance-Accuracy Trade-Offs in LLM Quantization

The ongoing evolution of LLMs has been accompanied by significant computational and operational challenges, particularly at inference time. The paper "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization addresses this challenge by examining the intricacies of model quantization as a means to enhance inference efficiency without compromising model accuracy. This empirical study focuses on a rich set of quantization formats—FP8, INT8, and INT4—evaluated across a broad spectrum of academic and real-world benchmarks using the Llama-3.1 model family.

Central to the study is the exploration of the accuracy-performance trade-offs inherent in model quantizations. The paper highlights an extensive evaluation involving over 500,000 assessments and provides significant insights:

  1. FP8 Quantization Efficacy: The study finds that FP8 quantization (W8A8-FP) is lossless across various model scales, thereby enabling the retention of the original model’s accuracy while making it inference-ready with reduced operational requirements.
  2. INT8 Performance: Properly tuned INT8 quantization (W8A8-INT) demonstrates a surprisingly small accuracy degradation, maintaining just a 1-3\% loss on average. This is particularly noteworthy as previous conceptions indicated significant losses when using INT8 quantized activations.
  3. Competitive INT4 Quantization: INT4 weight-only quantization (W4A16-INT) reveals competitive performance compared to its 8-bit counterpart in specific scenarios, challenging previous stances that underscored considerable accuracy sacrifices with lower-bit quantization.

In addition to theoretical evaluations, the paper ventures into pragmatic areas, particularly regarding inference performance, using the vLLM framework across various GPU architectures. This exploration reveals that despite different hardware requirements and task demands, quantization can be optimized for different deployment environments. W4A16, for instance, demonstrated cost-efficiency advantages in synchronous deployments, while W8A8 was advantageous for asynchronous deployments on advanced GPUs.

The study's depth in bridging the gap between theoretical accuracy and practical deployment capability provides several guidelines for efficient deployment of quantized LLMs. The key takeaway remains that with considered quantization strategies, significant computational savings can be realized without compromising the qualitative outputs expected from LLMs.

Implications and Future Directions

The findings underscore the potential of model quantization for broad applications, especially in democratizing access to LLM capabilities by reducing inference costs. The demonstrated efficacy of these quantization approaches could inspire further advancements in inference acceleration and reduced resource consumption, likely stimulating new research into compression algorithms.

Future work may explore more complex deployment scenarios, emphasizing multi-modal tasks and diverse architectures beyond GPUs. Furthermore, as LLMs continue to grow in size and application bandwidth increases, there might be a need for more nuanced quantization strategies that intelligently adapt to task-specific requirements or alternate between precision levels dynamically depending on contextual needs.

In summary, this study provides a comprehensive benchmark of quantization methodologies, offering a detailed reference that practitioners and researchers can leverage to optimize LLM deployments. By doing so, it also lays a foundation for future works aimed at improving quantization techniques and expanding their applicability across various machine learning and artificial intelligence domains.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 21 tweets with 634 likes about this paper.

HackerNews