PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Published 3 Apr 2024 in cs.LG and cs.AI | (2404.02948v4)

Abstract: To parameter-efficiently fine-tune (PEFT) LLMs, the low-rank adaptation (LoRA) method approximates the model changes $ΔW \in \mathbb{R}^{m \times n}$ through the product of two matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} \in \mathbb{R}^{m \times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA. Code is available at https://github.com/GraphPKU/PiSSA.

Abstract PDF HTML Upgrade to Chat

References (57)

Citations (41)

View on Semantic Scholar

Summary

The paper introduces PiSSA, a method that uses principal singular values and vectors for efficient fine-tuning of large language models.
The approach achieves superior convergence and a 5.16% accuracy improvement on benchmarks like GSM8K compared to LoRA.
The integration with quantization strategies reduces quantization error by 19%, emphasizing its effectiveness in low-memory scenarios.

PiSSA: Principal Singular Values and Singular Vectors Adaptation of LLMs

Introduction

The computational burden associated with fine-tuning LLMs often becomes prohibitive as the parameter count increases. To address this, a parameter-efficient fine-tuning method named Principal Singular values and Singular vectors Adaptation (PiSSA) has been developed. This method optimizes a reduced parameter space while maintaining or exceeding the performance of full-parameter fine-tuning by utilizing the low intrinsic dimension of pre-trained, over-parameterized models.

Methodology

PiSSA employs Singular Value Decomposition (SVD) to factorize a weight matrix $W$ in an LLM as a product of two matrices, $A$ and $B$ , of much lower rank. Specifically, PiSSA represents $W$ by:

$W \approx AB + W^{res}$

where the matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$ are initialized using the principal singular values and vectors, while $W^{res}$ captures the residual part and remains unchanged during training. This process allows for the efficient fine-tuning of LLMs by focusing on essential model features and ignoring less significant components.

Figure 1: Full Fine-tuning.

Comparison with LoRA

PiSSA shares the architecture with Low-Rank Adaptation (LoRA) but differs significantly in initialization strategy. While LoRA uses random Gaussian noise and zeros for initialization, which can waste gradient descent steps and lead to convergence issues, PiSSA starts with the principal singular components of $W$ . This offers better initialization which enables faster convergence and improved performance.

Figure 2: Original matrix W.

Empirical evaluations across multiple benchmarks demonstrate PiSSA’s superior convergence rate and performance compared to LoRA. Specifically, fine-tuning Mistral-7B with PiSSA on the GSM8K benchmark results in an accuracy of 72.86%, surpassing LoRA by approximately 5.16%.

Figure 3: Comparing the quantization error, the fine-tuning loss on the MetaMathQA and the accuracy on the GSM8K and MATH validation sets.

Integration with Quantization

PiSSA also reduces quantization error substantially—by 19% compared to methods like QLoRA. This enhancement further solidifies PiSSA’s efficacy, especially in scenarios demanding low-memory consumption.

Figure 4: Variation of loss with respect to rank 1 throughout the training phase. Additional ranks are depicted.

Practical Implications and Future Work

By maintaining compatibility with LoRA’s architectural framework, PiSSA inherits many of its advantages such as parameter efficiency and easy integration with quantization strategies. Future explorations could assess PiSSA’s performance on larger models and varied tasks and integrate advanced techniques from LoRA’s successors to further enhance performance.

Conclusion

PiSSA offers a robust and efficient method for fine-tuning LLMs. By leveraging principal singular values and vectors, it improves convergence and accuracy compared to existing approaches like LoRA while maintaining a reduced computational overhead. This provides a compelling direction for the continued advancement of parameter-efficient methodologies in machine learning research.

Figure 5: Initializing with principal, medium, and minor singular values and vectors, the training loss on the MetaMathQA and the accuracy on the GSM8K and MATH validation sets are reported, respectively, for three models.