AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression

Published 2 Apr 2026 in cs.LG | (2604.02119v1)

Abstract: We introduce a fast low-rank factorization-based framework for compressing LLMs that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on LLMs show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that integrating anchored and adaptive SVD with block-level joint refinement significantly enhances LLM compression performance.
It introduces a novel two-stage approach that minimizes output drift from distributional shifts, ensuring lower MSE and robust downstream accuracy.
Empirical results on LLaMA and Qwen models show that AA-SVD outperforms traditional SVD methods by achieving superior perplexity and accuracy at high compression ratios.

AA-SVD: Anchored and Adaptive SVD for LLM Compression

Introduction and Motivation

The rapid proliferation of large-scale pretrained LLMs has exacerbated their computational and memory demands, making resource-efficient deployment an immediate research priority. While structured pruning, quantization, and low-rank factorization have each contributed to this goal, post-training SVD-based factorization methods are attractive for their deployment practicality and accelerator efficiency. However, classical SVD-based approaches typically ignore the compound distributional shifts induced by upstream compression, and either anchor exclusively on original or shifted activations. This leads to error accumulation and degraded downstream fidelity, especially at aggressive compression ratios.

Methodological Contributions

AA-SVD proposes a principled two-stage framework for block-structured transformer compression, synthesizing the benefits of activation-aware and shift-aware objectives, and introducing block-level joint refinement.

Anchored and Adaptive Compression Objective

Unlike input-agnostic (Eckart–Young–Mirsky, $\|\mathbf{W} - \mathbf{W}'\|_F^2$ ) or input-aware ( $\|\mathbf{WY} - \mathbf{W}'\mathbf{Y}\|_F^2$ ) SVD approaches, AA-SVD minimizes the distance between original outputs and the outputs of the compressed operator acting on post-compression-activation distributions:

$\min_{\mathrm{rank}(\mathbf{W}') = k} \|\mathbf{WY} - \mathbf{W}' \mathbf{Y}'\|_F^2$

where $\mathbf{Y}$ are reference activations and $\mathbf{Y}'$ are activations from the (partially) compressed upstream model. The solution is provided in closed form, only requiring the relevant cross-covariances:

$\mathbf{W}'^\star = \operatorname{SVD}_k( \mathbf{W} \mathbf{Y} \mathbf{Y}'^\top (\mathbf{Y}'\mathbf{Y}'^\top)^{-1} \mathbf{L} ) \mathbf{L}^{-1}$

where $\mathbf{L}$ is a Cholesky/EVD decomposition of $\mathbf{Y}' \mathbf{Y}'^\top$ .

This blends the advantages of preserving original-model fidelity while explicitly mitigating the drift induced by sequential compression.

Block-Level Joint Optimization

AA-SVD accounts for error interactions among linear submodules within a transformer block by introducing post-layerwise block-local refinement:

After initializing linear layers with the above SVD solution, all low-rank factors in a block (and local normalization/bias) are further optimized to minimize the MSE between original and compressed block outputs on contemporaneously compressed activations.
This approach allows within-block error compensation and substantially suppresses output drift as shown empirically.

Empirical Results

AA-SVD is benchmarked on LLaMA and Qwen model families (across 1B–13B parameters), evaluated on both standard language modeling datasets (WikiText2, PTB, C4) and a suite of zero-shot commonsense reasoning tasks.

Compression vs. performance trade-off: AA-SVD consistently outperforms SVD-LLM, Dip-SVD, ASVD, and SAES-SVD on perplexity and downstream accuracy at 0.8, 0.6, and 0.4 compression ratios.
Error Robustness: Under aggressive ratios (0.4–0.6), prior SVD methods exhibit catastrophic degradation or model collapse. AA-SVD maintains functional accuracy and perplexity margins.
Calibration budget sensitivity: Figure 1 demonstrates that perplexity saturates with 64–256 calibration samples, but zero-shot accuracy benefits from modestly larger calibration sets, suggesting that block-level structure is robust but downstream generalization is modestly sample-dependent.
Figure 1: AA-SVD outperforms baselines across a range of calibration set sizes on both perplexity (WikiText2, C4) and zero-shot accuracy, with diminishing returns beyond 256 samples.
Layerwise error analysis: AA-SVD suppresses cosine distance and MSE growth across model depth compared to SVD-LLM and input-agnostic baselines, especially for attention and MLP-down projections (see Figure 2).

Figure 2: Layerwise MSE for output projection (O-proj) shows joint block-level AA-SVD keeps per-layer distortion and drift lower, mitigating error compounding through depth.

Ablation: Without block-level refinement, input-agnostic or shift-aware SVD collapses at moderate ratios. With block-level joint optimization, AA-SVD's initialization becomes critical for final quality, affirming the benefit of the anchored + adaptive objective.
Comparison to structured pruning: AA-SVD with the Dobi-style remapping consistently delivers lower perplexity and higher accuracy at fixed parameter/memory budgets compared to SliceGPT, LLM-Pruner, and Bonsai, especially at high compression.

Practical and Theoretical Implications

The study identifies two critical insights for post-training LLM compression:

Compound error sensitivity: Layerwise SVD is insufficient; interacting compression errors must be jointly accounted for at block granularity.
Distribution-shift anchoring: Assigning the anchoring reference to original outputs while projecting compressed activations addresses both functional fidelity and the reality of test-time input distributions after compression.

These design choices result in models that:

Remain performant at aggressive compression ratios where all baselines fail.
Maintain efficient inference compatibility (FLOPs, KV-cache) and can be composed with quantization.
Are robust to modest calibration sizes, crucial for fast or data-limited deployment pipelines.

Future Directions

AA-SVD exposes several promising directions:

Adaptive rank/capacity allocation: Block-local optimization complicates naive per-layer sensitivity analysis. Developing globally optimal, refinement-aware capacity allocation strategies can further improve pareto efficiency.
Hybrid pipelines: SVD, quantization, and structured pruning target mostly orthogonal redundancy; joint pipelines with block-level objectives may unlock more aggressive yet robust model compression.
Theoretical bounds: Quantifying error compensation and drift for block-level refinement under various architectural regimes warrants further analysis.

Conclusion

AA-SVD delivers a robust, practical SVD-based compression protocol for large pretrained transformers that substantially outperforms state-of-the-art baselines at a range of compression ratios, especially in aggressive regimes. Its key innovation is block-level joint refinement initialized by an anchored and adaptive SVD solution, bridging the gap between distributional faithfulness and end-task utility. These results establish the block-wise, calibration-efficient, post-training paradigm as a scalable strategy for efficient, high-fidelity LLM deployment.

Markdown Report Issue