Information-Theoretic Scaling Laws
- Information-Theoretic Scaling Laws are mathematical relationships quantifying how metrics like error, capacity, and mutual information scale with resources such as model size, data, and compute.
- They offer a unified framework that explains power-law decays observed in deep learning, kernel regression, and communication networks by linking resource constraints to performance.
- These principles guide optimal resource allocation and experimental design, helping to balance data quality and quantity in diverse systems from wireless networks to quantum error correction.
Information-theoretic scaling laws govern how metrics such as generalization error, capacity, mutual information, or error rates scale with key resources (model size, dataset size, compute, measurement quality) in systems constrained by information processing, learning, or communication. These laws provide quantifiable prescriptions for resource allocation, predict asymptotic behaviors, and explain empirical observations in a diverse range of settings—from deep learning theory, representation learning, and molecular communication to wireless and quantum networks.
1. Fundamental Information-Theoretic Frameworks
Information-theoretic scaling laws describe how the achievable performance of a system—measured via mutual information, entropy, channel capacity, or Bayes-optimal risk—scales as a function of resource axes such as sample number, model size, computational power, measurement fidelity, or environmental coupling. A foundational paradigm is to write the metric of interest (e.g., reducible loss or risk) as
where is the frequency of "atomic patterns" in the data (often with a Zipfian or heavy tail), and is the residual error per pattern. Sharp cutoff phenomena, "effective frontiers" , and step-function approximations justify piecewise power-law decays in , depending on the dominant limiting mechanism—model capacity, data coverage, or optimization (Zou et al., 1 Feb 2026).
In statistical learning and kernel regression, similar power-law scaling laws are provable under specific assumptions on the kernel spectrum (e.g., a polynomial tail index ) and target smoothness . The excess risk decays as , where the exponent relates to the redundancy index of the data (Bi et al., 25 Sep 2025).
Channel capacity in communication, including wireless and molecular-diffusion channels, is likewise governed by information-theoretic maximal mutual information, subject to constraints from noise, propagation, and channel structure (Eckford et al., 2014, 0804.3271).
2. Deep Learning, Model and Data Scaling Laws
Empirical neural scaling laws have established that for autoregressive generative models and Transformers, the test loss or cross-entropy decays with model size and compute as a power law plus constant:
where is the irreducible entropy, and the decay exponent is domain-dependent (typically in vision, language, multimodal, and math domains) (Henighan et al., 2020). Optimal compute allocation yields with , consistent across modalities.
Information-theoretic analyses now provide rigorous explanations of these laws. For two-layer infinite-width networks trained on Gaussian data and labels generated from wide ReLU networks, upper bounds on reducible error separate into estimation and misspecification terms, controlled by mutual information and KL divergence (Jeon et al., 2024, Jeon et al., 2022). Under a fixed compute budget , the minimization over model size and dataset size yields an optimal linear allocation: , confirming the observed near-linear data:parameter tradeoff (Chinchilla law).
In hierarchical multi-index models, the statistical efficiency and feature recovery follow exact phase transitions in sample complexity, with information-theoretic optimal scaling rates and plateaus analyzable via spike-detection in the data covariance and random matrix inference (Defilippis et al., 5 Feb 2026).
A unification emerges from the "Effective Frontiers" perspective: learning resources (parameters , data , compute ) induce cutoffs in the pattern-frequency space, and overall reducible loss tracks the residual pattern mass above the resource-adapted frontier. The tightest bottleneck dominates scaling, reconciling Kaplan (param-centric) and Chinchilla (data-centric) regimes within a max-bottleneck optimization (Zou et al., 1 Feb 2026).
3. Redundancy, Spectrum, and Universality of Scaling Exponents
The origin of the power-law exponents in learning curves is traced to the decay of the data covariance or kernel spectrum (tail index ), the target smoothness , and the redundancy . The generalization error for kernel regression under a polynomial spectral tail and source smoothness is
(Bi et al., 25 Sep 2025). Redundancy laws unify random features, Transformers (in both NTK and feature-learning regimes), and domain-mixed settings, as spectral purification or increased steepen scaling.
This universality extends: representation-invariant transformations, linear mixtures, and featurization all preserve or bound the tail index, hence also ; data with worse redundancy (smaller ) slows learning. The mutual information between samples and function, , mediates the error decay via the Gaussian channel formula. Coding-theoretic analogies identify the excess risk's exponent with the redundancy cost of encoding function components in high-dimensional directions.
4. Noise, Measurement Quality, and Resource Axes Beyond and
Beyond model and data size, information-theoretic scaling laws rigorously quantify the impact of measurement noise or data quality. In cellular representation learning and image classification, information extracted by the learned representations, measured via mutual information with a supervised target, deteriorates as a logarithmic function of the effective measurement sensitivity :
This result, confirmed across models and domains, is analytically derived via multivariate Gaussian channel formulas (Gowri et al., 4 Mar 2025). When measurement noise is large (), the system is noise-limited: increased sampling depth or measurement precision yields diminishing logarithmic returns in extracted information.
This axis complements the power-law decay in loss with and . Practical implications include explicit inversion of the scaling law to determine the required measurement quality for a desired information threshold, and optimal allocation between improving data quality and data quantity.
5. Network Information-Theoretic Scaling: Communication and Quantum Lifetime
Classical network information theory yields a spectrum of scaling laws as network topology, path-loss, and clustering are varied. For wireless and clustered ad hoc networks, sum capacity and its scaling exponent depend on geometrical and physical parameters:
- Power-limited, bandwidth-limited, and hybrid regimes for large wireless networks are characterized by the relation between short- and long-distance SNR and the path-loss exponent . Hierarchical cooperation, multihop, and hybrid schemes are asymptotically optimal in different regimes, with exponents classified precisely (0804.3271, 0809.1205).
- In underwater acoustic networks, the exponential attenuation with distance (due to absorption ) fundamentally drives capacity scaling, making nearest-neighbor multi-hop transmission order-optimal when the attenuation parameter grows exponentially with network size (Shin et al., 2010).
- For molecular communication via Brownian motion, capacity scales logarithmically in time or molecule number when one resource is fixed, but linearly when both resources scale jointly, reflecting a transition from diminishing to sustained returns (Eckford et al., 2014).
Quantum information dynamics in monitored open systems exhibit a distinct dichotomy: conditioning on all measurement outcomes of the bath (environment) produces exponentially long information retention (lifetime scaling as with data register size), whereas discarding measurement outcomes reduces the lifetime to linear or constant in system size. This demonstrates an essential separation between trajectory-conditioned and averaged-state diagnostics—a fundamentally information-theoretic feature of monitored quantum dynamics (Zhang et al., 28 Jun 2025).
6. Practical Implications and Methodological Guidance
Information-theoretic scaling laws provide actionable prescriptions:
- In deep learning, simultaneous scaling of model and dataset size is necessary for optimal generalization reduction under compute constraints, with the best tradeoff typically at (Jeon et al., 2024, Jeon et al., 2022).
- In generative modeling, forecasts for required parameter count for a target KL-divergence or loss are enabled by empirical scaling parameters extracted from observed fits (Henighan et al., 2020).
- In representation learning, embedding effective rank (Shannon entropy of normalized singular spectrum) predicts downstream linear-probe quality and absorbs the impact of model depth, width, and data, yielding a power-law plus saturation curve—universally across audio, vision, and text modalities (Deng et al., 13 Oct 2025).
- Measurement design in experimental settings should balance sample number, data quality, and model size, guided by the analytic and empirical forms of scaling laws.
- Communication and network engineering require careful attention to the physical scaling parameters (e.g., attenuation, SNR regimes) that fundamentally limit throughput, matching architecture to the correct operational regime for order-optimality (0804.3271, Shin et al., 2010).
- In quantum information, recording and leveraging environment measurement outcomes can radically alter memory-time scaling, with direct implications for quantum computing and quantum error correction protocol design (Zhang et al., 28 Jun 2025).
7. Limitations, Assumptions, and Open Directions
The general validity of information-theoretic scaling laws requires:
- Asymptotic resource limits () and often i.i.d., non-pathological data.
- Accurate characterization of spectral tails or pattern-frequency distributions (often assumed Zipf-law or heavy-tailed).
- Independence or weak correlation of atomic patterns—violations (strong correlations, hierarchical or compositional structure) may require refined frameworks (Zou et al., 1 Feb 2026).
- In optimization scaling, self-similar bias kernels and monotonicity of error profiles are typically assumed; non-convex, multi-phase, or highly adaptive training procedures may not obey these simple laws.
- For noise-scaling, the Gaussian channel idealization may break down in heavy-tailed or discrete-noise regimes.
Extensions to interacting features, multimodal scaling, compositional generalization, and non-asymptotic corrections remain active research areas. Finite-size and phase transition effects, especially in high-dimensional or hierarchically-structured learning problems, are significant in practice and an area of ongoing investigation (Defilippis et al., 5 Feb 2026, Zou et al., 1 Feb 2026).
In conclusion, information-theoretic scaling laws quantitatively unify deep learning, representation learning, communication, and quantum systems under a shared analytical framework that connects resource constraints, statistical redundancy, and operational asymptotics. Empirical laws observed across domains are now grounded in the language of information theory and spectral statistics, enabling principled design and analysis of complex learning and communication systems (Bi et al., 25 Sep 2025, Jeon et al., 2024, Henighan et al., 2020, Zou et al., 1 Feb 2026, Zhang et al., 28 Jun 2025, Gowri et al., 4 Mar 2025, Eckford et al., 2014, Deng et al., 13 Oct 2025).