Papers
Topics
Authors
Recent
Search
2000 character limit reached

The LLM Surgeon

Published 28 Dec 2023 in cs.LG and cs.CL | (2312.17244v2)

Abstract: State-of-the-art LLMs are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to LLMs. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Pattern recognition and machine learning, volume 4. Springer, 2006.
  2. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pp. 557–565. PMLR, 2017.
  3. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475–4488, 2022.
  4. Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
  5. Matrix computations. JHU press, 2013.
  6. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  7. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  8. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pp. 4466–4475. PMLR, 2021.
  9. Invariance learning in deep neural networks with differentiable laplace approximations. Advances in Neural Information Processing Systems, 35:12449–12463, 2022.
  10. Efficient approximations of the fisher matrix in neural networks using kronecker product singular value decomposition. arXiv preprint arXiv:2201.10285, 2022.
  11. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  12. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  13. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  14. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  15. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  16. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
  17. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  18. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  20. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  21. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International conference on machine learning, pp. 6566–6575. PMLR, 2019.
  22. Wikipedia. Wikipedia. PediaPress, 2004.
  23. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  24. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  25. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
Citations (8)

Summary

  • The paper introduces the LLM Surgeon framework, employing Kronecker-factored curvature approximations for efficient pruning of large language models.
  • It leverages block-diagonal Fisher information matrices to overcome Hessian computation challenges, enabling both structured and unstructured pruning.
  • Empirical results demonstrate a 20%-30% compression in models like OPT and Llama v2 with negligible performance loss.

The LLM Surgeon

Introduction

The paper explores the challenges associated with deploying LLMs within computational and environmental constraints, given the increasing size of transformer architectures. Instead of training smaller models from scratch, the authors propose data-driven compression techniques for existing pretrained models. The core contribution is the introduction of the LLM Surgeon framework, which utilizes Kronecker-factored curvature approximations to efficiently prune large models, achieving notable compression rates with minimal impact on performance. Figure 1

Figure 1: LLM Surgeon allows interpolation of model size between existing pretrained models.

Pruning as Constrained Optimization

The paper leverages insights from earlier works on network pruning, notably Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS), where pruning is treated as a constraint optimization problem. The major limitation of these past approaches was the impracticality of computing the Hessian, which scales quadratically with the number of parameters. The LLM Surgeon addresses this by employing block-diagonal approximations through Kronecker-factorization, allowing practical pruning operations even for large networks like LLMs. Figure 2

Figure 2: Pruning as equality constrained optimization of quadratic approximation of the loss landscape (left), or equivalently, maximising the likelihood under a Laplace approximation (right).

Methodology

The LLM Surgeon provides a general framework that supports unstructured, semi-structured, and structured pruning of LLMs. Through detailed exploration of the Kronecker-factored empirical Fisher information matrix, the method estimates the loss landscape's curvature efficiently. It derives OBS-like weight pruning costs and updates that capture correlations between weights, offering substantial improvements over previous unstructured approaches that ignore gradients.

Results

Empirical evaluations demonstrate the efficacy of LLM Surgeon in compressing models such as OPT models and Llama v2 by 20%-30%, maintaining negligible performance loss. The method achieves state-of-the-art results in unstructured and semi-structured pruning of LLMs. Notably, LLM Surgeon is the first method reported to perform structured pruning on LLMs effectively, marking a significant advancement in the field.

Implementation and Performance

LLM Surgeon is built to operate in multiple shots allowing weight updates and curvature estimations between each pruning iteration. This approach ensures the surrogate loss landscape remains reliable by keeping the pruning operation within the local region of the Taylor expansion. Furthermore, the method utilizes optional low-rank first-order updates through interleaved LoRA updates between shots, enhancing performance in certain cases. Figure 3

Figure 3: Sparsity levels obtained with structured pruning on OPT-125m by layer depth and type.

Conclusion

LLM Surgeon addresses the practical challenges of deploying large pretrained models by offering a robust framework for model compression without the need to retrain smaller models from scratch. It provides a significant improvement over existing pruning techniques, leveraging advanced Fisher information matrix approximations to maintain high model performance post-pruning. The framework's adaptability to different pruning structures underscores its versatility and efficacy in real-world applications, making it a valuable tool for practitioners working with LLMs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 10 tweets with 17 likes about this paper.