Papers
Topics
Authors
Recent
Search
2000 character limit reached

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Published 3 Apr 2024 in cs.LG and cs.AI | (2404.02948v4)

Abstract: To parameter-efficiently fine-tune (PEFT) LLMs, the low-rank adaptation (LoRA) method approximates the model changes $ΔW \in \mathbb{R}{m \times n}$ through the product of two matrices $A \in \mathbb{R}{m \times r}$ and $B \in \mathbb{R}{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W{res} \in \mathbb{R}{m \times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA. Code is available at https://github.com/GraphPKU/PiSSA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  2. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  3. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  4. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  5. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  6. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  7. Wizardlm: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations, 2023.
  8. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  9. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  10. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148, 2023.
  13. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv preprint arXiv:2403.14608, 2024.
  14. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  15. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
  16. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  17. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. arXiv preprint arXiv:2305.16597, 2023.
  18. Masking as an efficient alternative to finetuning for pretrained language models. arXiv preprint arXiv:2004.12406, 2020.
  19. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
  20. Composable sparse fine-tuning for cross-lingual transfer. arXiv preprint arXiv:2110.07560, 2021.
  21. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021.
  22. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020.
  23. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807, 2023.
  24. Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  26. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  27. Gpt understands, too. AI Open, 2023.
  28. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
  29. Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts. arXiv preprint arXiv:2205.11961, 2022.
  30. Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023.
  31. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  32. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020.
  33. Conditional adapters: Parameter-efficient transfer learning with fast inference. Advances in Neural Information Processing Systems, 36, 2024.
  34. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  35. Adapterdrop: On the efficiency of adapters in transformers. arXiv preprint arXiv:2010.11918, 2020.
  36. Tiny-attention adapter: Contexts are more important than the number of parameters. arXiv preprint arXiv:2211.01979, 2022.
  37. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  38. Mera: Merging pretrained adapters for few-shot learning. arXiv preprint arXiv:2308.15982, 2023.
  39. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. arXiv preprint arXiv:2106.04489, 2021.
  40. Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023.
  41. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2022.
  42. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022.
  43. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv:2308.12043, 2023.
  44. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411, 2023.
  45. Pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403, 2023.
  46. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717, 2023.
  47. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  50. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  51. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  52. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  53. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.
  54. Evaluating large language models trained on code, 2021.
  55. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  56. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  57. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
Citations (41)

Summary

  • The paper introduces PiSSA, a method that uses principal singular values and vectors for efficient fine-tuning of large language models.
  • The approach achieves superior convergence and a 5.16% accuracy improvement on benchmarks like GSM8K compared to LoRA.
  • The integration with quantization strategies reduces quantization error by 19%, emphasizing its effectiveness in low-memory scenarios.

PiSSA: Principal Singular Values and Singular Vectors Adaptation of LLMs

Introduction

The computational burden associated with fine-tuning LLMs often becomes prohibitive as the parameter count increases. To address this, a parameter-efficient fine-tuning method named Principal Singular values and Singular vectors Adaptation (PiSSA) has been developed. This method optimizes a reduced parameter space while maintaining or exceeding the performance of full-parameter fine-tuning by utilizing the low intrinsic dimension of pre-trained, over-parameterized models.

Methodology

PiSSA employs Singular Value Decomposition (SVD) to factorize a weight matrix WW in an LLM as a product of two matrices, AA and BB, of much lower rank. Specifically, PiSSA represents WW by:

W≈AB+WresW \approx AB + W^{res}

where the matrices A∈Rm×rA \in \mathbb{R}^{m \times r} and B∈Rr×nB \in \mathbb{R}^{r \times n} are initialized using the principal singular values and vectors, while WresW^{res} captures the residual part and remains unchanged during training. This process allows for the efficient fine-tuning of LLMs by focusing on essential model features and ignoring less significant components. Figure 1

Figure 1

Figure 1

Figure 1: Full Fine-tuning.

Comparison with LoRA

PiSSA shares the architecture with Low-Rank Adaptation (LoRA) but differs significantly in initialization strategy. While LoRA uses random Gaussian noise and zeros for initialization, which can waste gradient descent steps and lead to convergence issues, PiSSA starts with the principal singular components of WW. This offers better initialization which enables faster convergence and improved performance. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Original matrix W.

Empirical evaluations across multiple benchmarks demonstrate PiSSA’s superior convergence rate and performance compared to LoRA. Specifically, fine-tuning Mistral-7B with PiSSA on the GSM8K benchmark results in an accuracy of 72.86%, surpassing LoRA by approximately 5.16%. Figure 3

Figure 3: Comparing the quantization error, the fine-tuning loss on the MetaMathQA and the accuracy on the GSM8K and MATH validation sets.

Integration with Quantization

PiSSA also reduces quantization error substantially—by 19% compared to methods like QLoRA. This enhancement further solidifies PiSSA’s efficacy, especially in scenarios demanding low-memory consumption. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Variation of loss with respect to rank 1 throughout the training phase. Additional ranks are depicted.

Practical Implications and Future Work

By maintaining compatibility with LoRA’s architectural framework, PiSSA inherits many of its advantages such as parameter efficiency and easy integration with quantization strategies. Future explorations could assess PiSSA’s performance on larger models and varied tasks and integrate advanced techniques from LoRA’s successors to further enhance performance.

Conclusion

PiSSA offers a robust and efficient method for fine-tuning LLMs. By leveraging principal singular values and vectors, it improves convergence and accuracy compared to existing approaches like LoRA while maintaining a reduced computational overhead. This provides a compelling direction for the continued advancement of parameter-efficient methodologies in machine learning research. Figure 5

Figure 5

Figure 5

Figure 5: Initializing with principal, medium, and minor singular values and vectors, the training loss on the MetaMathQA and the accuracy on the GSM8K and MATH validation sets are reported, respectively, for three models.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 11 tweets with 1596 likes about this paper.