Papers
Topics
Authors
Recent
Search
2000 character limit reached

LinearlyCompressedGPT Compression Techniques

Updated 19 February 2026
  • LinearlyCompressedGPT is a family of methods that reduce computational and memory footprints in GPT-style models through linear modifications, factorization, or quantization without full retraining.
  • The approach employs techniques like blockwise quantization, Kronecker and tensor-train decompositions, and hierarchical grouping to achieve up to 10× reduction in model size with minimal loss in accuracy.
  • These methods enable efficient deployment and inference acceleration on hardware such as FPGAs, balancing trade-offs between compression and performance for real-world applications.

LinearlyCompressedGPT refers to a broad family of techniques for reducing the computational and memory footprint of GPT-style (decoder-only Transformer) LLMs through purely linear structural modifications, factorization, or quantization, usually without full retraining. Unlike distillation, pruning, or non-linear rewiring, these methods compress the large matrix multiplications and embeddings at the core of Transformer models using quantization, structured matrix approximations, hierarchical blockwise dimension reductions, or sparse and grouped computation. This enables deployment to resource-constrained environments while maintaining acceptable loss in performance. LinearlyCompressedGPT has been realized through a variety of frameworks; major lines of work include blockwise quantization, Kronecker and tensor-train (TT) decomposition, hierarchical dynamic grouping, and progressive depthwise linear projection.

1. Blockwise Quantization: The BCT Approach

Blockwise Compression of Transformers (BCT) implements LinearlyCompressedGPT through blockwise shift quantization on all linear matrices and bias vectors, without retraining (Dong et al., 2023). The method partitions each matrix into discrete B×B blocks and applies independent scale quantization within each. This sharply reduces quantization-induced distribution shift compared to per-layer schemes, eliminating retraining requirements.

Core Quantization Process:

Let xRB×Bx\in\mathbb{R}^{B\times B}, kk denote bit-width, Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1].

  • Compute the block’s shift:

shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor

  • Quantize:

xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)

  • Decompress:

Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}

At inference, GEMMs are computed in low-bit, exponent-aligned blocks.

Error Bound and Theoretical Guarantees:

Elementwise quantization error is bounded by half a quantization bin; error per layer is O(2k)O(2^{-k}). Residuals do not induce out-of-distribution behavior globally, and empirical boxplots demonstrate per-block error does not propagate destructively.

Empirical Results:

  • BERT-base (as a stand-in for GPT): $4$-bit weights plus $8$-bit activations yield 7.988×7.988\times model size reduction; kk0 accuracy loss (e.g., kk1 GLUE SST-2). Pure kk2-bit quantization yields kk3 reduction with near-zero loss.
  • fp8 (8-bit float) BCT achieves kk4 size reduction with kk5 accuracy loss.

Block/Bit Parameterization and Trade-offs:

Block size kk6 is typical; kk7 for aggressive shrinkage, kk8 for zero-loss compression. Larger kk9 reduces meta-data but coarsens the quantization; smallest Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]0 and largest Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]1 that maintain acceptable perplexity are recommended.

2. Hierarchical Dynamic Grouping and Attention

Hierarchical models, exemplified by GPTHF, fundamentally restructure the Transformer, compressing token sequences into fixed-size sentence embeddings and operating subsequent transformations at the sentence level (Gu et al., 14 Mar 2025).

Architecture:

  • A word-level Transformer encoder processes tokens within a sentence using block-local self-attention.
  • Sentence-level representations are pooled, forming Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]2 PoolingIk=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]3(wlt_encoder(...)) for each sentence.
  • A second Transformer “body” operates causally over sentence embeddings.

Dynamic Sparse Attention:

  • During encoding, attention is masked to intra-sentence blocks.
  • At the sentence level, embeddings attend to all prior sentences.

Inference and Caching Optimization:

By caching finished sentence embeddings, GPTHF reuses computation, yielding per-token complexity that scales as Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]4 (where Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]5 is sentences), rather than Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]6 tokenwise.

Empirical Trade-offs:

GPTHF achieves up to Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]7 reduction in FLOPs and Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]8 speedup on certain tasks, at the expense of Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1]9-point perplexity penalty. Sentence-splitting is critical; generation quality can be affected by sentence-boundary prediction (Gu et al., 14 Mar 2025).

3. Structured Linear Factorizations: Kronecker, TT, and Orthogonal Transformations

Several schemes target the replacement of dense matrices with mathematically structured, low-parametric forms:

3.1 Kronecker Products

Krony-PT and KnGPT2 both compress transformer and embedding matrices via Kronecker factorizations (Ayad et al., 2024, Edalati et al., 2021). For a weight shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor0, choose shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor1, then approximate shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor2, with storage dropping to shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor3 from shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor4.

  • Krony-PT: Either single or multi-factor, leveraging Van Loan SVD initialization or pruning-based methods. Compression of the GPT-2 FFN from shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor5 by factors shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor6 and shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor7 gives effective models of shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor8M versus original shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor9M—perplexity outperforms distillation baselines (Ayad et al., 2024).
  • KnGPT2: Compresses half of all linear layers and embedding via rank-1 factorizations, recovers performance with only minimal pretraining (intermediate-layer KD) (Edalati et al., 2021).

3.2 Tensor-Train (TT) Decomposition

TTD reshapes large matrices into high-order tensors and factors them into sequential “cores” (Huang et al., 31 Jan 2025, Xu et al., 2023). Given xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)0, tensorize xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)1 into xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)2-tuples, then represent xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)3 as multiplication through a chain of xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)4 TT-cores. Compression ratio is:

xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)5

  • TTD achieves layer-level compression up to xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)6 and whole-network xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)7–xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)8, with minimal loss: e.g., xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)9 PPL, Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}0 C-EVal for ChatGLM3-6B, LLaMA2-7B (Huang et al., 31 Jan 2025).

TT is particularly effective for the embedding layer (e.g., experimentally, Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}1–Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}2 compression with negligible loss) (Xu et al., 2023).

3.3 Orthogonal Transforms and Structured Projections

ProcrustesGPT rotates weights via orthogonal Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}3 to maximize compressibility under structured families such as Kronecker-sum or permutation-sparse matrices (Grishina et al., 3 Jun 2025). Layerwise alternating minimization is performed:

  • Step A: Project weights into chosen structured class given Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}4.
  • Step B: Solve a weighted Orthogonal Procrustes Problem to update Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}5.

Without fine-tuning, Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}6–Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}7 weight compression is attainable with consistently lower perplexity than other fine-tuning-free baselines (Grishina et al., 3 Jun 2025).

4. Architectural Linear Projection Variants

A distinct technique modifies the GPT stack by inserting linear dimensionality reductions between groups of layers. In the LinearGPT architecture (lc-gpt), after every two blocks, the hidden dimension is linearly halved, with intermediate linear layers Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}8 (Suresh et al., 2024).

Structural Recursion:

Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}9

This reduces total parameter count by O(2k)O(2^{-k})0 and speeds up training by O(2k)O(2^{-k})1 with no measurable loss in task performance on code-completion objectives.

5. Vocabulary and Output Layer Compression

High memory and compute cost in the output head can be dominated by the vocabulary projection. A two-level grouping approach partitions the vocabulary using BPE merges, then applies shared per-group linear transformations with per-group scale and shift (Vennam et al., 2024).

  • For O(2k)O(2^{-k})2-way softmax, introduce O(2k)O(2^{-k})3 groups and O(2k)O(2^{-k})4 tokens per group. The softmax is decomposed as:

O(2k)O(2^{-k})5

  • Reduces activation memory up to O(2k)O(2^{-k})6 and speeds up throughput by up to O(2k)O(2^{-k})7, with negligible drop in human-rated TinyStories metrics.

6. Hardware Mapping and Inference Acceleration

TTD-compressed models mapped to hardware such as FPGA via Group Vector Systolic Array (GVSA) architectures deliver further acceleration (Huang et al., 31 Jan 2025). Execution of TT-sharded matrix multiplies is serviced by parallel vector PEs, with pipelined partial sum reordering. ChatGLM3-6B and LLaMA2-7B deployed in this format achieved O(2k)O(2^{-k})8-O(2k)O(2^{-k})9 first-token delay reductions and throughput exceeding optimized GPU baselines.

7. Trade-Offs, Limitations, and Selection Guidelines

  • Compression vs. Accuracy: Aggressive quantization ($4$0) or deep low-rank factorization provides compression up to $4$1, typically at $4$2–$4$3 loss in perplexity or task accuracy, depending on the scheme (Dong et al., 2023, Gu et al., 14 Mar 2025, Ayad et al., 2024).
  • Block/Rank Choices: Empirical sweep of block size, bit-width, Kronecker rank, or TT-rank is essential—default recommendations include block size $4$4, $4$5 or $4$6, Kronecker rank $4$7–$4$8, TT ranks chosen to keep per-layer error within $4$9 PPL.
  • No-Retrainability: Methods such as BCT and ProcrustesGPT can be applied directly to a pretrained model with calibration data only, avoiding expensive retraining loops (Dong et al., 2023, Grishina et al., 3 Jun 2025).
  • Applicability: Structure-based methods (Kronecker, TT) are amenable both to encoder and decoder (GPT) architectures and can be combined with other compression regimes (pruning, quantization).
  • Limitations: Sentence-compression and hierarchical methods can induce sentence boundary and generation quality artifacts, especially on small models without auxiliary modeling (Gu et al., 14 Mar 2025).

References

LinearlyCompressedGPT frameworks thus offer a flexible design space, ranging from quantization to low-rank tensorization, for shrinking GPT-family models while retaining their essential generative capacity.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LinearlyCompressedGPT.