LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Abstract: Quantization is an indispensable technique for serving LLMs and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.
- Towards efficient post-training quantization of pre-trained language models. Advances in Neural Information Processing Systems, 35 1405–1418.
- Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701.
- The second pascal recognising textual entailment challenge.
- The fifth pascal recognizing textual entailment challenge. In TAC.
- SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online.
- Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222.
- Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Pointer sentinel mixture models.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
- Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599. https://api.semanticscholar.org/CorpusID:207756753
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035.
- Deploying quantization-aware trained networks using tensorrt. In GPU Technology Conference.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7 625–641.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE.
- Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.