The LLM Surgeon

Published 28 Dec 2023 in cs.LG and cs.CL | (2312.17244v2)

Abstract: State-of-the-art LLMs are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to LLMs. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of LLMs.

Abstract PDF HTML Upgrade to Chat

References (25)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces the LLM Surgeon framework, employing Kronecker-factored curvature approximations for efficient pruning of large language models.
It leverages block-diagonal Fisher information matrices to overcome Hessian computation challenges, enabling both structured and unstructured pruning.
Empirical results demonstrate a 20%-30% compression in models like OPT and Llama v2 with negligible performance loss.

The LLM Surgeon

Introduction

The paper explores the challenges associated with deploying LLMs within computational and environmental constraints, given the increasing size of transformer architectures. Instead of training smaller models from scratch, the authors propose data-driven compression techniques for existing pretrained models. The core contribution is the introduction of the LLM Surgeon framework, which utilizes Kronecker-factored curvature approximations to efficiently prune large models, achieving notable compression rates with minimal impact on performance.

Figure 1: LLM Surgeon allows interpolation of model size between existing pretrained models.

Pruning as Constrained Optimization

The paper leverages insights from earlier works on network pruning, notably Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS), where pruning is treated as a constraint optimization problem. The major limitation of these past approaches was the impracticality of computing the Hessian, which scales quadratically with the number of parameters. The LLM Surgeon addresses this by employing block-diagonal approximations through Kronecker-factorization, allowing practical pruning operations even for large networks like LLMs.

Figure 2: Pruning as equality constrained optimization of quadratic approximation of the loss landscape (left), or equivalently, maximising the likelihood under a Laplace approximation (right).

Methodology

The LLM Surgeon provides a general framework that supports unstructured, semi-structured, and structured pruning of LLMs. Through detailed exploration of the Kronecker-factored empirical Fisher information matrix, the method estimates the loss landscape's curvature efficiently. It derives OBS-like weight pruning costs and updates that capture correlations between weights, offering substantial improvements over previous unstructured approaches that ignore gradients.

Results

Empirical evaluations demonstrate the efficacy of LLM Surgeon in compressing models such as OPT models and Llama v2 by 20%-30%, maintaining negligible performance loss. The method achieves state-of-the-art results in unstructured and semi-structured pruning of LLMs. Notably, LLM Surgeon is the first method reported to perform structured pruning on LLMs effectively, marking a significant advancement in the field.

Implementation and Performance

LLM Surgeon is built to operate in multiple shots allowing weight updates and curvature estimations between each pruning iteration. This approach ensures the surrogate loss landscape remains reliable by keeping the pruning operation within the local region of the Taylor expansion. Furthermore, the method utilizes optional low-rank first-order updates through interleaved LoRA updates between shots, enhancing performance in certain cases.

Figure 3: Sparsity levels obtained with structured pruning on OPT-125m by layer depth and type.

Conclusion

LLM Surgeon addresses the practical challenges of deploying large pretrained models by offering a robust framework for model compression without the need to retrain smaller models from scratch. It provides a significant improvement over existing pruning techniques, leveraging advanced Fisher information matrix approximations to maintain high model performance post-pruning. The framework's adaptability to different pruning structures underscores its versatility and efficacy in real-world applications, making it a valuable tool for practitioners working with LLMs.