LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Abstract: LLMs, such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. At a 50\% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%. Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at https://github.com/aim-uofa/LoRAPrune.
- Piqa: Reasoning about physical commonsense in natural language. In Proc. AAAI Conf. on Arti. Intel., volume 34, pp. 7432–7439, 2020.
- What is the state of neural network pruning? Proc. Int. Conf. Mach. Learn. and Syst., 2:129–146, 2020.
- One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Proc. Adv. Neural Inf. Process. Syst., 2022.
- An empirical study of training self-supervised vision transformers. In Proc. IEEE Int. Conf. Comp. Vis., pp. 9640–9649, 2021.
- Longlora: Efficient fine-tuning of long-context large language models, 2023.
- Chinese-vicuna: A chinese instruction-following llama-based model. 2023. URL https://github.com/Facico/Chinese-Vicuna.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Learning to prune deep neural networks via layer-wise optimal brain surgeon. Proc. Adv. Neural Inf. Process. Syst., 30, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Lottery tickets in linear models: An analysis of iterative magnitude pruning. arXiv preprint arXiv:2007.08243, 2020.
- Depgraph: Towards any structural pruning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 16091–16101, June 2023.
- Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
- Bootstrap your own latent-a new approach to self-supervised learning. 33:21271–21284, 2020.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023.
- Learning both weights and connections for efficient neural network. Proc. Adv. Neural Inf. Process. Syst., 28, 2015.
- Optimal brain surgeon and general network pruning. In Proc. IEEE Conf. on Neural Networks, pp. 293–299, 1993.
- Sensitivity-aware visual parameter-efficient tuning. In ICCV, 2023.
- Masked autoencoders are scalable vision learners. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 16000–16009, 2022.
- Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2019.
- Learning filter pruning criteria for deep convolutional neural networks acceleration. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2009–2018, 2020.
- LoRA: Low-rank adaptation of large language models. In Proc. Int. Conf. Learn. Repren., 2022.
- Visual prompt tuning. In Proc. Eur. Conf. Comp. Vis., 2022.
- Optimal brain damage. Proc. Adv. Neural Inf. Process. Syst., 2, 1989.
- Layer-adaptive sparsity for the magnitude-based pruning. arXiv preprint arXiv:2010.07611, 2020.
- Snip: Single-shot network pruning based on connection sensitivity. In Proc. Int. Conf. Learn. Repren., 2019.
- Optimization based layer-wise magnitude-based pruning for dnn compression. In Int. Joi. Conf. on Artificial Intelligence, pp. 2383–2389, 2018.
- Pruning filters for efficient convnets. In Proc. Int. Conf. Learn. Repren., 2017.
- Revisiting random channel pruning for neural network compression. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 191–201, 2022a.
- Parameter-efficient sparsity for large language models fine-tuning. arXiv preprint arXiv:2205.11005, 2022b.
- Towards efficient visual adaption via structural re-parameterization. arXiv preprint arXiv:2302.08106, 2023.
- Llm-pruner: On the structural pruning of large language models, 2023.
- Building a large annotated corpus of english: The penn treebank. 1993.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Pruning convolutional neural networks for resource efficient inference. In Proc. Int. Conf. Learn. Repren., 2017.
- Importance estimation for neural network pruning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 11264–11272, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. 21(1):5485–5551, 2020.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Movement pruning: Adaptive sparsity by fine-tuning. Proc. Adv. Neural Inf. Process. Syst., 33:20378–20389, 2020.
- A simple and effective pruning approach for large language models, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020.
- Generative visual prompt: Unifying distributional control of pre-trained generative models. Proc. Adv. Neural Inf. Process. Syst., 35:22422–22437, 2022.
- Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In Proc. IEEE Int. Sym. on High-Perf. Comp. Arch., pp. 273–286. IEEE, 2023.
- Width & depth pruning for vision transformers. In Proc. AAAI Conf. on Arti. Intel., volume 36, pp. 3143–3151, 2022a.
- The combinatorial brain surgeon: Pruning weights that cancel one another in neural networks. In Proc. Int. Conf. Mach. Learn., pp. 25668–25683, 2022b.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, pp. 1–9, 2022.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
- Platon: Pruning large transformer models with upper confidence bound of weight importance. In Proc. Int. Conf. Mach. Learn., pp. 26809–26823, 2022.
- Transpim: A memory-based acceleration via software-hardware co-design for transformer. In Proc. IEEE Int. Sym. on High-Perf. Comp. Arch., pp. 1071–1085. IEEE, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.