Papers
Topics
Authors
Recent
Search
2000 character limit reached

Information-Theoretic Scaling Laws

Updated 6 February 2026
  • Information-Theoretic Scaling Laws are mathematical relationships quantifying how metrics like error, capacity, and mutual information scale with resources such as model size, data, and compute.
  • They offer a unified framework that explains power-law decays observed in deep learning, kernel regression, and communication networks by linking resource constraints to performance.
  • These principles guide optimal resource allocation and experimental design, helping to balance data quality and quantity in diverse systems from wireless networks to quantum error correction.

Information-theoretic scaling laws govern how metrics such as generalization error, capacity, mutual information, or error rates scale with key resources (model size, dataset size, compute, measurement quality) in systems constrained by information processing, learning, or communication. These laws provide quantifiable prescriptions for resource allocation, predict asymptotic behaviors, and explain empirical observations in a diverse range of settings—from deep learning theory, representation learning, and molecular communication to wireless and quantum networks.

1. Fundamental Information-Theoretic Frameworks

Information-theoretic scaling laws describe how the achievable performance of a system—measured via mutual information, entropy, channel capacity, or Bayes-optimal risk—scales as a function of resource axes such as sample number, model size, computational power, measurement fidelity, or environmental coupling. A foundational paradigm is to write the metric of interest (e.g., reducible loss A ⁣LA\!L or risk) as

A ⁣L=k=1pkqk,A\!L = \sum_{k=1}^\infty p_k\,q_k,

where pkp_k is the frequency of "atomic patterns" in the data (often with a Zipfian or heavy tail), and qkq_k is the residual error per pattern. Sharp cutoff phenomena, "effective frontiers" k(R)k_*(R), and step-function approximations justify piecewise power-law decays in A ⁣L(R)A\!L(R), depending on the dominant limiting mechanism—model capacity, data coverage, or optimization (Zou et al., 1 Feb 2026).

In statistical learning and kernel regression, similar power-law scaling laws are provable under specific assumptions on the kernel spectrum (e.g., a polynomial tail index β\beta) and target smoothness ss. The excess risk decays as nαn^{-\alpha}, where the exponent α=2s/(2s+1/β)\alpha=2s/(2s+1/\beta) relates to the redundancy index 1/β1/\beta of the data (Bi et al., 25 Sep 2025).

Channel capacity in communication, including wireless and molecular-diffusion channels, is likewise governed by information-theoretic maximal mutual information, subject to constraints from noise, propagation, and channel structure (Eckford et al., 2014, 0804.3271).

2. Deep Learning, Model and Data Scaling Laws

Empirical neural scaling laws have established that for autoregressive generative models and Transformers, the test loss or cross-entropy decays with model size NN and compute CC as a power law plus constant:

L(N)=L+ANαL(N) = L_\infty + A N^{-\alpha}

where LL_\infty is the irreducible entropy, and the decay exponent α\alpha is domain-dependent (typically 0.15α0.250.15\leq\alpha\leq0.25 in vision, language, multimodal, and math domains) (Henighan et al., 2020). Optimal compute allocation yields Nopt(C)CβN_{\text{opt}}(C)\propto C^\beta with β0.7\beta\approx0.7, consistent across modalities.

Information-theoretic analyses now provide rigorous explanations of these laws. For two-layer infinite-width networks trained on Gaussian data and labels generated from wide ReLU networks, upper bounds on reducible error separate into estimation and misspecification terms, controlled by mutual information and KL divergence (Jeon et al., 2024, Jeon et al., 2022). Under a fixed compute budget CC, the minimization over model size nn and dataset size TT yields an optimal linear allocation: nTCn^*\sim T^*\sim \sqrt{C}, confirming the observed near-linear data:parameter tradeoff (Chinchilla law).

In hierarchical multi-index models, the statistical efficiency and feature recovery follow exact phase transitions in sample complexity, with information-theoretic optimal scaling rates and plateaus analyzable via spike-detection in the data covariance and random matrix inference (Defilippis et al., 5 Feb 2026).

A unification emerges from the "Effective Frontiers" perspective: learning resources (parameters NN, data DD, compute CC) induce cutoffs in the pattern-frequency space, and overall reducible loss tracks the residual pattern mass above the resource-adapted frontier. The tightest bottleneck dominates scaling, reconciling Kaplan (param-centric) and Chinchilla (data-centric) regimes within a max-bottleneck optimization (Zou et al., 1 Feb 2026).

3. Redundancy, Spectrum, and Universality of Scaling Exponents

The origin of the power-law exponents in learning curves is traced to the decay of the data covariance or kernel spectrum (tail index β\beta), the target smoothness ss, and the redundancy 1/β1/\beta. The generalization error for kernel regression under a polynomial spectral tail λii1/β\lambda_i\sim i^{-1/\beta} and source smoothness ss is

En2s/(2s+1/β)E \sim n^{-2s/(2s+1/\beta)}

(Bi et al., 25 Sep 2025). Redundancy laws unify random features, Transformers (in both NTK and feature-learning regimes), and domain-mixed settings, as spectral purification or increased β\beta steepen scaling.

This universality extends: representation-invariant transformations, linear mixtures, and featurization all preserve or bound the tail index, hence also α\alpha; data with worse redundancy (smaller β\beta) slows learning. The mutual information between samples and function, InnαI_n \sim n^{\alpha}, mediates the error decay via the Gaussian channel formula. Coding-theoretic analogies identify the excess risk's exponent with the redundancy cost of encoding function components in high-dimensional directions.

4. Noise, Measurement Quality, and Resource Axes Beyond NN and DD

Beyond model and data size, information-theoretic scaling laws rigorously quantify the impact of measurement noise or data quality. In cellular representation learning and image classification, information extracted by the learned representations, measured via mutual information I(f(Zmeas);Y)I(f(Z_{\text{meas}}); Y) with a supervised target, deteriorates as a logarithmic function of the effective measurement sensitivity uu:

I(u)=Imax12log(1+uˉu)I(u) = I_{\max} - \frac{1}{2}\log\left(1 + \frac{\bar u}{u}\right)

This result, confirmed across models and domains, is analytically derived via multivariate Gaussian channel formulas (Gowri et al., 4 Mar 2025). When measurement noise is large (uuˉu\ll\bar u), the system is noise-limited: increased sampling depth or measurement precision yields diminishing logarithmic returns in extracted information.

This axis complements the power-law decay in loss with NN and DD. Practical implications include explicit inversion of the scaling law to determine the required measurement quality for a desired information threshold, and optimal allocation between improving data quality and data quantity.

5. Network Information-Theoretic Scaling: Communication and Quantum Lifetime

Classical network information theory yields a spectrum of scaling laws as network topology, path-loss, and clustering are varied. For wireless and clustered ad hoc networks, sum capacity C(n)C(n) and its scaling exponent depend on geometrical and physical parameters:

  • Power-limited, bandwidth-limited, and hybrid regimes for large wireless networks are characterized by the relation between short- and long-distance SNR and the path-loss exponent α\alpha. Hierarchical cooperation, multihop, and hybrid schemes are asymptotically optimal in different regimes, with exponents e(α,β)e(\alpha,\beta) classified precisely (0804.3271, 0809.1205).
  • In underwater acoustic networks, the exponential attenuation with distance (due to absorption a(f)da(f)^d) fundamentally drives capacity scaling, making nearest-neighbor multi-hop transmission order-optimal when the attenuation parameter grows exponentially with network size (Shin et al., 2010).
  • For molecular communication via Brownian motion, capacity scales logarithmically in time or molecule number when one resource is fixed, but linearly when both resources scale jointly, reflecting a transition from diminishing to sustained returns (Eckford et al., 2014).

Quantum information dynamics in monitored open systems exhibit a distinct dichotomy: conditioning on all measurement outcomes of the bath (environment) produces exponentially long information retention (lifetime scaling as ecNAe^{cN_A} with data register size), whereas discarding measurement outcomes reduces the lifetime to linear or constant in system size. This demonstrates an essential separation between trajectory-conditioned and averaged-state diagnostics—a fundamentally information-theoretic feature of monitored quantum dynamics (Zhang et al., 28 Jun 2025).

6. Practical Implications and Methodological Guidance

Information-theoretic scaling laws provide actionable prescriptions:

  • In deep learning, simultaneous scaling of model and dataset size is necessary for optimal generalization reduction under compute constraints, with the best tradeoff typically at NDN \propto D (Jeon et al., 2024, Jeon et al., 2022).
  • In generative modeling, forecasts for required parameter count for a target KL-divergence or loss are enabled by empirical scaling parameters extracted from observed fits (Henighan et al., 2020).
  • In representation learning, embedding effective rank (Shannon entropy of normalized singular spectrum) predicts downstream linear-probe quality and absorbs the impact of model depth, width, and data, yielding a power-law plus saturation curve—universally across audio, vision, and text modalities (Deng et al., 13 Oct 2025).
  • Measurement design in experimental settings should balance sample number, data quality, and model size, guided by the analytic and empirical forms of scaling laws.
  • Communication and network engineering require careful attention to the physical scaling parameters (e.g., attenuation, SNR regimes) that fundamentally limit throughput, matching architecture to the correct operational regime for order-optimality (0804.3271, Shin et al., 2010).
  • In quantum information, recording and leveraging environment measurement outcomes can radically alter memory-time scaling, with direct implications for quantum computing and quantum error correction protocol design (Zhang et al., 28 Jun 2025).

7. Limitations, Assumptions, and Open Directions

The general validity of information-theoretic scaling laws requires:

  • Asymptotic resource limits (N,D,CN, D, C \to \infty) and often i.i.d., non-pathological data.
  • Accurate characterization of spectral tails or pattern-frequency distributions (often assumed Zipf-law or heavy-tailed).
  • Independence or weak correlation of atomic patterns—violations (strong correlations, hierarchical or compositional structure) may require refined frameworks (Zou et al., 1 Feb 2026).
  • In optimization scaling, self-similar bias kernels and monotonicity of error profiles are typically assumed; non-convex, multi-phase, or highly adaptive training procedures may not obey these simple laws.
  • For noise-scaling, the Gaussian channel idealization may break down in heavy-tailed or discrete-noise regimes.

Extensions to interacting features, multimodal scaling, compositional generalization, and non-asymptotic corrections remain active research areas. Finite-size and phase transition effects, especially in high-dimensional or hierarchically-structured learning problems, are significant in practice and an area of ongoing investigation (Defilippis et al., 5 Feb 2026, Zou et al., 1 Feb 2026).


In conclusion, information-theoretic scaling laws quantitatively unify deep learning, representation learning, communication, and quantum systems under a shared analytical framework that connects resource constraints, statistical redundancy, and operational asymptotics. Empirical laws observed across domains are now grounded in the language of information theory and spectral statistics, enabling principled design and analysis of complex learning and communication systems (Bi et al., 25 Sep 2025, Jeon et al., 2024, Henighan et al., 2020, Zou et al., 1 Feb 2026, Zhang et al., 28 Jun 2025, Gowri et al., 4 Mar 2025, Eckford et al., 2014, Deng et al., 13 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Information-Theoretic Scaling Laws.