Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compression-Aware Scaling Law

Updated 2 January 2026
  • The compression-aware scaling law is a mathematical relation that extends classical scaling laws by explicitly incorporating compression parameters like quantization, sparsity, and encoding efficiency.
  • It quantifies how varying compression levels directly affect system performance, with methodologies validated by empirical fits (e.g., near-unity R² in multimodal models).
  • The law enables optimal resource allocation by balancing trade-offs between compression, model capacity, and speed across diverse systems from machine learning to physical applications.

A compression-aware scaling law is a mathematical relationship that explicitly incorporates the effects of data, model, or signal compression into a scaling law framework, quantifying how compression impacts resource–performance trade-offs in physical, information-theoretic, or machine-learning systems. Such laws generalize classical scaling theories by adding explicit dependency on compression parameters (e.g., quantization levels, sparsity, data encoding efficiency), enabling precise prediction and optimization of system behavior under resource constraints.

1. Theoretical Foundations and Mathematical Formulation

Compression-aware scaling laws emerge from the intersection of information theory, physical modeling, and statistical learning theory, extending baseline scaling laws to account for the effects of data or model compression. The key abstraction is the explicit inclusion of a compression parameter—such as a data compressibility metric, model sparsity, quantization granularity, or storage/bitrate—in the scaling function that relates system size, resources, or data volume to accuracy or loss.

A canonical example from multimodal foundation models is

Perfmulti(P,{Ti,Ci})α[ilog(TiCi)+logP]+ϵ\boxed{ \mathrm{Perf}_{\mathrm{multi}}(P,\{T_i,C_i\})\approx \alpha \Bigl[\sum_{i}\log\left(\frac{T_i}{C_i}\right)+\log P\Bigr]+\epsilon }

where TiT_i is the raw data size, CiC_i is the per-token compression cost for modality ii, and PP is model size. This extends single-modality laws (e.g., bits-per-character vs. logN+logP\log N+\log P) and demonstrates that compression efficiency directly modulates the effective data mass (Sun et al., 2024).

In LLMs under weight sparsity and quantization, the loss scaling law takes the form

L(N,D,C)=a(Neff(C))b+cDd+eL(N,D,C) = \frac{a}{(N\,\mathrm{eff}(C))^b} + \frac{c}{D^d} + e

where NN is the parameter count, DD the sample count, and eff(C)\mathrm{eff}(C) the product of compression multipliers for sparsity, weight quantization, and activation quantization, reducing the effective parameter count (Frantar et al., 23 Feb 2025, Panferov et al., 2 Jun 2025).

In the context of lossy data compression—such as for image storage or physical measurement—the test error on a supervised task may satisfy

E(N,L)E+ANα+BLβE(N,L) \approx E^* + A N^{-\alpha} + B L^{-\beta}

with LL the number of bits per sample and β\beta an exponent describing how compression quality influences task error, enabling optimization under storage constraints (Mentzer et al., 2024).

2. Modalities of Compression and Law Construction

The precise instantiation of a compression-aware scaling law depends on the mode of compression:

  • Data Compression: Quantifies information content remaining after encoding, often measured via explicit compressibility metrics—such as gzip bits-per-token for text—which enter scaling law parameters as predictors for irreducible error and exponent shifts (Pandey, 2024).
  • Model Compression: Incorporates parameter pruning or quantization by treating retained capacity as a multiplier on model size. For pruning, the effective model size is Neff=sNN_{\text{eff}} = s N (density ss); for quantization, Neff=qw(bw)NN_{\text{eff}} = q_w(b_w) N, with qwq_w the parameter efficiency factor at bit-width bwb_w (Frantar et al., 23 Feb 2025, Rosenfeld, 2021).
  • Physical Compression: In soft contact mechanics or thin-shell elasticity, compression ratio or deformation alters the scaling law for energy, force, relaxation time, or buckling, often via a power-law or correction function in the normalized compression parameter (Mu et al., 23 Sep 2025, Tobasco, 2016, Bøhling et al., 2011).

A summary of compression modes and corresponding law forms:

Compression Mode Control Parameter(s) Law Structure
Data (tokenization) CiC_i, compressibility Perflog(Ti/Ci)+logP\mathrm{Perf} \sim \log(T_i/C_i) + \log P
Model sparsity/quant. s,qw,qas,\,q_w,\,q_a L(Nsqwqa)b+cDd+eL \sim (N\,s\,q_w\,q_a)^{-b} + c D^{-d} + e
Image bitrate LL (bits/sample) E(N,L)Nα+LβE(N,L) \sim N^{-\alpha} + L^{-\beta}
Physical (soft body) δ/L\delta/L (strain) F(δ)δn[1kδ/L](n+2)/nF(\delta)\sim \delta^{n}[1 - k \delta/L]^{-(n+2)/n}
Compression ratio, ρ\rho rr (fraction removed) L(r)=L0α(1+r)βL(r) = L_0^\alpha (1 + r)^\beta, P(r)=P0α(1+r)βP(r) = P_0^\alpha (1 + r)^\beta

3. Empirical Validation and Algorithmic Implications

Compression-aware scaling laws are supported by extensive empirical evidence across modalities and domains:

  • In multimodal LMs, performance plotted as a function of ilog(Ti/Ci)+logP\sum_i\log(T_i/C_i)+\log P collapses diverse modality mixes onto a single linear regime, spanning four orders of magnitude, with a near-unity linear fit (R20.98R^2\approx 0.98) (Sun et al., 2024).
  • In deep learning under sparsity and quantization, effective parameter count models (using empirically measured multipliers for each compression type) recover scaling exponents and loss curves matching the dense case, and compositionally combine across hybrid compression schemes (Frantar et al., 23 Feb 2025, Panferov et al., 2 Jun 2025).
  • For image-based learning under bit-rate constraints, dual-exponent scaling in both number of images and bits per image accurately predicts error surfaces, and optimizing (N,L) given N·L = S gives measurable error reductions beyond naive allocation (Mentzer et al., 2024).
  • In LLMs, post-training quantization loss penalty is accurately predictable by a second-order Taylor expansion ΔL12Tr(H)W210SQNR/10\Delta L \approx \frac{1}{2} \operatorname{Tr}(H)\|W\|^2 10^{-{\rm SQNR}/10}, with capacity reductions from quantization and empirical fits generalizing across model families, bit-widths, and quantization algorithms (Xu et al., 2024).

In compressive sensing, analytical scaling laws predict the stability penalty as one backs off from the phase transition, e.g., in 1\ell_1 minimization recovery, the error constant increases as 1/1ω1/\sqrt{1-\omega} for fractional backoff ω\omega from the sparsity threshold (Xu et al., 2010).

Algorithmically, compression-aware laws enable principled selection of compression levels, model size, data composition, and storage/compute allocations to achieve target performance under resource constraints.

4. Trade-offs: Compression Parameter Effects

Compression-aware scaling laws enable transparent analysis of trade-offs:

  • Bit Allocation: Every bit decrease in tokenization cost, quantization width, or storage per sample conveys a quantitatively predictable gain, equivalent (often in log-space) to increased data, model size, or resource expenditure (Sun et al., 2024, Mentzer et al., 2024).
  • Modality Balance: In mixed-modality systems, highly compressible modalities such as text can compensate for less efficient modalities (e.g., video) under compute-limited budgets. Investing in more efficient codecs or learned tokenizers directly shifts the performance frontier (Sun et al., 2024).
  • Compression vs. Speed: In model pruning/quantization, loss increase vs. speedup is often linear or sublinear in the compression ratio for moderate regime, with diminishing returns and sharp penalties below certain precision (Sengupta et al., 6 Apr 2025, Xu et al., 2024).

Optimization under a fixed storage or compute constraint can be formulated explicitly from the scaling law to yield closed-form or numerically tight optima for bit allocation, data count, or parameter count (Mentzer et al., 2024).

5. Unified Capacity and Hybrid Compression

A recent advance is the unified capacity approach: for any compressed representation RR, empirical scaling laws hold when model size is multiplied by a capacity factor ρ(R)\rho(R), determined by the mean squared error of representing Gaussian random vectors (GMSE). Compositionally, multiple compressions (e.g., sparsity + quantization) simply multiply their capacities, unifying the scaling law across all formats with a single parametrization (Panferov et al., 2 Jun 2025).

Loss(N,D;R)=A[Nρ(R)]α+BDβ+E\operatorname{Loss}(N,D;R) = A [N\rho(R)]^{-\alpha} + B D^{-\beta} + E

ρ(R)=L[tanh(Flog1/4GMSE(R))]C\rho(R) = L\, [\tanh(F \log_{1/4}{\rm GMSE}(R))]^C

This universality enables direct comparison and optimization of compression strategies prior to training.

6. Limitations, Deviations, and Domain-Specific Regimes

Not all domains admit a universal compression-aware scaling law. In time series forecasting, empirical evidence shows that scaling error with parameter count flattens rapidly, with model architectural innovations such as horizon-adaptive decomposition dominating over parameter count or raw compression (Li et al., 15 May 2025).

In structural mechanics, elasticity models for thin shells or soft contacts establish compression-aware scaling via explicit bounding of energy, force, or relaxation as a function of thickness, strain, and geometric confinement. Depending on parameter regime, transitions (“wrinkling regimes,” buckling thresholds) can result in different minimizers and nontrivial crossovers (Mu et al., 23 Sep 2025, Tobasco, 2016).

7. Practical Methodologies and Recommendations

Derivation and application of compression-aware scaling laws require:

  • Empirical measurement or estimation of compression parameters (e.g., per-modality tokenization efficiency, model effective capacity, storage bits/sample).
  • Fitting dual-parameter or multi-parameter power-law or log-linear models (joint in model size and compression parameter), with careful validation of linear regime and plateaus.
  • Algorithmic support for compositional compression, e.g., RMSE-based masking schemes to optimize effective capacity under sparsity and quantization (Panferov et al., 2 Jun 2025).
  • Caution for law breakdown: verify empirical fits in extrapolated regimes; double descent, overparameterization, or architectural phase transitions can invalidate naive power laws.

Compression-aware scaling enables deliberate resource–performance navigation in high-dimensional, resource-constrained systems, and is a critical design tool from machine learning model deployment to experimental physics and engineering.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compression-Aware Scaling Law.