Papers
Topics
Authors
Recent
Search
2000 character limit reached

INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

Published 13 Jun 2023 in cs.CL, cs.AI, and cs.LG | (2306.08162v1)

Abstract: We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized LLMs. First, we develop an extremely memory-efficient fine-tuning (EMEF) method for quantized models using Low-Rank Adaptation (LoRA), and drawing upon it, we construct an error-correcting algorithm designed to minimize errors induced by the quantization process. Our method reduces the memory requirements by up to 5.6 times, which enables fine-tuning a 7 billion parameter LLM on consumer laptops. At the same time, we propose a Low-Rank Error Correction (LREC) method that exploits the added LoRA layers to ameliorate the gap between the quantized model and its float point counterpart. Our error correction framework leads to a fully functional INT2 quantized LLM with the capacity to generate coherent English text. To the best of our knowledge, this is the first INT2 LLM that has been able to reach such a performance. The overhead of our method is merely a 1.05 times increase in model size, which translates to an effective precision of INT2.1. Also, our method readily generalizes to other quantization standards, such as INT3, INT4, and INT8, restoring their lost performance, which marks a significant milestone in the field of model quantization. The strategies delineated in this paper hold promising implications for the future development and optimization of quantized models, marking a pivotal shift in the landscape of low-resource machine learning computations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  2. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  4. Scaling instruction-finetuned language models, 2022.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  6. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc., 2022.
  7. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022.
  8. Optimal brain compression: A framework for accurate post-training quantization and pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  10. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  11. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  12. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  13. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  14. {BRECQ}: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2021.
  15. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  16. Gpt understands, too. arXiv preprint arXiv:2103.10385, 2021.
  17. Biogpt: Generative pre-trained transformer for biomedical text generation and mining. 10 2022.
  18. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
  19. Pointer sentinel mixture models, 2016.
  20. OpenAI. Gpt-4 technical report, 2023.
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  22. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models, 2023.
  23. Language models are unsupervised multitask learners. 2019.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  25. Multitask Prompted Training Enables Zero-Shot Task Generalization. In ICLR 2022 - Tenth International Conference on Learning Representations, Online, Unknown Region, April 2022.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Ul2: Unifying language learning paradigms, 2023.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  29. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  30. Finetuned language models are zero-shot learners. CoRR, abs/2109.01652, 2021.
  31. Emergent abilities of large language models, 2022.
  32. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  33. Llama-adapter: Efficient fine-tuning of language models with zero-init attention, 2023.
  34. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  35. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.
Citations (17)

Summary

  • The paper introduces a dual-framework—EMEF for fine-tuning via LoRA and LREC for error correction—achieving up to 5.6× VRAM reduction.
  • It employs a quantization-agnostic error correction mechanism that restores coherence across INT2, INT3, and INT4 standards.
  • The method enables fine-tuning on consumer-grade hardware, significantly lowering resource constraints and broadening LLM accessibility.

Summary of "INT2.1: Towards Fine-Tunable Quantized LLMs with Error Correction through Low-Rank Adaptation"

The paper "INT2.1: Towards Fine-Tunable Quantized LLMs with Error Correction through Low-Rank Adaptation" (2306.08162) introduces a memory-efficient approach to fine-tuning widely used quantized LLMs. Two significant innovations underpin the study: an Extremely Memory-Efficient Finetuning (EMEF) method based on Low-Rank Adaptation (LoRA), and a quantization-agnostic error correction framework, Low-Rank Error Correction (LREC). These innovations address the prominent challenges associated with LLM quantization, such as balancing memory efficiency against model performance.

Methodological Advances

The paper's primary methodological contributions revolve around efficiently addressing memory consumption while mitigating quantization errors in LLMs. The first component, EMEF, leverages LoRA to integrate floating-point parameters into quantized models. This technique decouples fine-tuning from the full model, enabling memory-constrained fine-tuning endeavors on consumer-grade hardware like personal laptops. This framework leads to a quantization paradigm that maintains model expressiveness through additional learned parameters, significantly reducing VRAM requirements—by up to 5.6 times the memory reduction—over conventional methods.

In particular, the study defines optimization as the minimization of statistical distance between output distributions of pre-quantized and quantized models, enhanced by KL divergence and cross-entropy. Low-Rank Error Correction (LREC) exploits LoRA-injected parameters to compensate for quantization-induced inaccuracies, restoring performance across different quantization standards such as INT2, INT3, and INT4. Figure 1

Figure 1: Flowchart of the forward and backward path, designed for a quantized model with injected LoRA parameters. In the diagram, the yellow blocks represent matrices with quantized parameters, while the green blocks represent matrices with floating point parameters. MM stands for matrix multiplication.

Experimental Framework and Results

The experimental analysis validates the practical applicability of these methods in supporting memory-intense processes. Using the 7 billion parameter LLaMA model, the authors establish their method's capability of running fine-tuning on hardware with 8GB or less VRAM. Suggested by their framework, a quantized INT2 model effectively maintains the coherence of generated text with minimal memory overhead, making this pioneering in its class.

From an evaluation standpoint, the LREC method significantly reduces perplexity across standard benchmarks like WikiText and PTB datasets, offering better accuracy compared to competitive baselines. For instance, the INT2 LLM corrected with LREC accomplished coherence unattable by direct quantization, with perplexity reductions from several thousand to nearly 12—confirming the efficacy of error correction. Figure 2

Figure 2: Perplexity comparison of quantized 7B LLaMA models with and without using our Low-Rank Error Correction method at different parameter precisions ranging from INT2 to INT4.

Implications and Future Directions

This study paves the way for democratizing access to state-of-the-art LLMs by enabling performance enhancements within memory-constrained environments. The integration of independent error-correcting parameters in quantized models not only adds redundancy for error improvement but also suggests modulation capabilities that prior methods did not afford.

Further research might explore the method's generalizable applicability across larger models and diverse tasks, potentially refining injection mechanisms or exploring hybrid quantization paradigms. Given the results, continued exploration into the scalability of LoRA parameters and their interactions within differently-quantized architectures would be valuable for maintaining robust performance across larger parameter spaces without sacrificing resource efficiency.

Conclusion

The research delineates a strategic direction for deploying fine-tunable, quantized LLMs with superior memory efficiency and low error margins. Combining EMEF and LREC, it establishes a comprehensive framework under which quantized models maintain valuable deployment characteristics, offering practical utility in environments devoid of high computational resources. This innovative approach authentically positions INT2.1 as a template for future quantization strategies, directly impacting the accessibility and deployment of sophisticated NLP models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 468 likes about this paper.