INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation

Published 13 Jun 2023 in cs.CL, cs.AI, and cs.LG | (2306.08162v1)

Abstract: We introduce a method that dramatically reduces fine-tuning VRAM requirements and rectifies quantization errors in quantized LLMs. First, we develop an extremely memory-efficient fine-tuning (EMEF) method for quantized models using Low-Rank Adaptation (LoRA), and drawing upon it, we construct an error-correcting algorithm designed to minimize errors induced by the quantization process. Our method reduces the memory requirements by up to 5.6 times, which enables fine-tuning a 7 billion parameter LLM on consumer laptops. At the same time, we propose a Low-Rank Error Correction (LREC) method that exploits the added LoRA layers to ameliorate the gap between the quantized model and its float point counterpart. Our error correction framework leads to a fully functional INT2 quantized LLM with the capacity to generate coherent English text. To the best of our knowledge, this is the first INT2 LLM that has been able to reach such a performance. The overhead of our method is merely a 1.05 times increase in model size, which translates to an effective precision of INT2.1. Also, our method readily generalizes to other quantization standards, such as INT3, INT4, and INT8, restoring their lost performance, which marks a significant milestone in the field of model quantization. The strategies delineated in this paper hold promising implications for the future development and optimization of quantized models, marking a pivotal shift in the landscape of low-resource machine learning computations.

Abstract PDF HTML Upgrade to Chat

References (35)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a dual-framework—EMEF for fine-tuning via LoRA and LREC for error correction—achieving up to 5.6× VRAM reduction.
It employs a quantization-agnostic error correction mechanism that restores coherence across INT2, INT3, and INT4 standards.
The method enables fine-tuning on consumer-grade hardware, significantly lowering resource constraints and broadening LLM accessibility.

Summary of "INT2.1: Towards Fine-Tunable Quantized LLMs with Error Correction through Low-Rank Adaptation"

The paper "INT2.1: Towards Fine-Tunable Quantized LLMs with Error Correction through Low-Rank Adaptation" (2306.08162) introduces a memory-efficient approach to fine-tuning widely used quantized LLMs. Two significant innovations underpin the study: an Extremely Memory-Efficient Finetuning (EMEF) method based on Low-Rank Adaptation (LoRA), and a quantization-agnostic error correction framework, Low-Rank Error Correction (LREC). These innovations address the prominent challenges associated with LLM quantization, such as balancing memory efficiency against model performance.

Methodological Advances

The paper's primary methodological contributions revolve around efficiently addressing memory consumption while mitigating quantization errors in LLMs. The first component, EMEF, leverages LoRA to integrate floating-point parameters into quantized models. This technique decouples fine-tuning from the full model, enabling memory-constrained fine-tuning endeavors on consumer-grade hardware like personal laptops. This framework leads to a quantization paradigm that maintains model expressiveness through additional learned parameters, significantly reducing VRAM requirements—by up to 5.6 times the memory reduction—over conventional methods.

In particular, the study defines optimization as the minimization of statistical distance between output distributions of pre-quantized and quantized models, enhanced by KL divergence and cross-entropy. Low-Rank Error Correction (LREC) exploits LoRA-injected parameters to compensate for quantization-induced inaccuracies, restoring performance across different quantization standards such as INT2, INT3, and INT4.

Figure 1: Flowchart of the forward and backward path, designed for a quantized model with injected LoRA parameters. In the diagram, the yellow blocks represent matrices with quantized parameters, while the green blocks represent matrices with floating point parameters. MM stands for matrix multiplication.

Experimental Framework and Results

The experimental analysis validates the practical applicability of these methods in supporting memory-intense processes. Using the 7 billion parameter LLaMA model, the authors establish their method's capability of running fine-tuning on hardware with 8GB or less VRAM. Suggested by their framework, a quantized INT2 model effectively maintains the coherence of generated text with minimal memory overhead, making this pioneering in its class.

From an evaluation standpoint, the LREC method significantly reduces perplexity across standard benchmarks like WikiText and PTB datasets, offering better accuracy compared to competitive baselines. For instance, the INT2 LLM corrected with LREC accomplished coherence unattable by direct quantization, with perplexity reductions from several thousand to nearly 12—confirming the efficacy of error correction.

Figure 2: Perplexity comparison of quantized 7B LLaMA models with and without using our Low-Rank Error Correction method at different parameter precisions ranging from INT2 to INT4.

Implications and Future Directions

This study paves the way for democratizing access to state-of-the-art LLMs by enabling performance enhancements within memory-constrained environments. The integration of independent error-correcting parameters in quantized models not only adds redundancy for error improvement but also suggests modulation capabilities that prior methods did not afford.

Further research might explore the method's generalizable applicability across larger models and diverse tasks, potentially refining injection mechanisms or exploring hybrid quantization paradigms. Given the results, continued exploration into the scalability of LoRA parameters and their interactions within differently-quantized architectures would be valuable for maintaining robust performance across larger parameter spaces without sacrificing resource efficiency.

Conclusion

The research delineates a strategic direction for deploying fine-tunable, quantized LLMs with superior memory efficiency and low error margins. Combining EMEF and LREC, it establishes a comprehensive framework under which quantized models maintain valuable deployment characteristics, offering practical utility in environments devoid of high computational resources. This innovative approach authentically positions INT2.1 as a template for future quantization strategies, directly impacting the accessibility and deployment of sophisticated NLP models.

Markdown Report Issue