Finite-Sample Redundancy Laws in Information Theory

Updated 26 September 2025

Finite-sample redundancy laws are quantitative relationships that characterize the excess inefficiency in algorithms due to limitations like finite precision, delay, and sample size.
They reveal how redundancy scales with specific parameters in contexts such as source coding, universal compression, and deep learning.
These laws guide practical design choices by balancing trade-offs in resource allocation, system robustness, and performance optimization across diverse applications.

Finite-sample redundancy laws refer to rigorous quantitative relationships that characterize the excess penalty—in terms of expected code length, risk, or representational inefficiency—incurred in various algorithms, models, and physical systems due to non-asymptotic, resource-constrained, or "imperfect" conditions such as finite precision, finite delay, finite sample size, finite blocklength, or structural limitations. These laws establish how redundancy scales with problem parameters, algorithmic choices, and implementation constraints, and are foundational for understanding optimality, robustness, and resource allocation in information theory, coding, compression, statistical inference, signal processing, and learning theory.

1. Precision–Redundancy Tradeoffs in Source Coding

Finite-precision representation of source probabilities directly leads to excess redundancy in classic source coding algorithms such as Shannon, Gilbert-Moore, Huffman, and arithmetic codes. For a source with alphabet size $m$ , probabilities $p_i$ are approximated by rationals $f_i/t$ stored with $W$ bits, yielding a redundancy $R$ that satisfies the subadditive bound: $W \lesssim \eta \log_2 \frac{m}{R},$ where $\eta$ is an implementation-dependent constant ( $\frac{1}{2}$ for binary sources, $m/(m+1)$ for optimized $m$ -ary codes, $1$ for general progressive update designs) (0712.0057). The Kullback–Leibler divergence $D(p \| \hat{p})$ is bounded via the maximal approximation error $\delta^*$ as $D(p \| \hat{p}) \lesssim m\delta^*/P_{\min}$ , translating the effect of denominator $t$ (and hence $W$ ) to the residual redundancy. The binary case admits Diophantine-optimal approximations with redundancy decaying as $1/t^2$ (leading to a halved $W$ ), while $m$ -ary cases exhibit redundancy decay as $1/t^{1+1/m}$ , with practical code design implications for memory, hardware register width, and symbol grouping.

2. Delay–Redundancy Laws in Lossless Source Coding

Imposing a finite decoding delay $d$ on lossless source codes fundamentally affects the redundancy decay rate. In block/phrase-constrained coding (e.g., Huffman, Tunstall), redundancy decays polynomially with block/phrase length (O(1/d)); in contrast, delay-constrained sequential encoders (e.g., delay-limited arithmetic coding with bit flushing) achieve exponential decay: $R(P, d) \lesssim 2^{-dH_2(P)},$ where $H_2(P)$ is the Rényi entropy of order 2 of the source (Shayevitz et al., 2010). The redundancy-delay exponent $E(P)$ , defined as $E(P) = \liminf_{d \to \infty} -\frac{1}{d}\log R(P,d)$ , is lower-bounded by $H_2(P)$ , but for almost all sources, it cannot exceed a bound depending on the minimal symbol probability and alphabet size. This exponential scaling marks a qualitative improvement over classical codes, and optimal code design under delay constraints is inextricably linked to the fine-grain properties of $P$ .

3. Redundancy Laws in Universal Data Compression on Countable Alphabets

For universal coding over a countably infinite alphabet, redundancy for a class $\mathcal{P}$ depends crucially on tail behavior. Finite single-letter redundancy (i.e., existence of $q$ with $\sup_{p \in \mathcal{P}} D(p \| q) < \infty$ ) implies tightness, but not necessarily diminishing per-symbol redundancy with blocklength (Hosseini et al., 2014, Hosseini et al., 2018). The asymptotic per-symbol redundancy $R(\mathcal{P}^\infty)$ equals the tail redundancy: $T(\mathcal{P}) = \lim_{m \to \infty} \inf_q \sup_{p \in \mathcal{P}} \sum_{x \geq m} p(x) \log \frac{p(x)}{q(x)},$ revealing that the cost of compressing novel, "tail" symbols dominates as $n$ grows: finite single-letter redundancy does not guarantee $R_n(\mathcal{P})/n \to 0$ , and only classes with vanishing tail redundancy are strongly compressible. This formalism captures the true essence of finite-sample redundancy in infinite-alphabet compression.

4. Minimax Redundancy and Regret in Parametric Models

In smooth parametric families (e.g., exponential families), finite-sample minimax redundancy and regret are determined by the Shtarkov and Jeffreys integrals (0903.5399, Beirami et al., 2011). For a $d$ -parameter family, the worst-case redundancy exhibits the canonical scaling: $R_n = \frac{d}{2}\log n + \log J + o(1)$ where $\log J$ is the Jeffreys correction term. Sufficient conditions for finite redundancy include restriction to compact parameter sets and tail decay of the base measure (density $q(x) = O(1/x^{1+\alpha})$ for some $\alpha > 0$ ). For universal codes (including two-stage codes), the asymptotic average minimax redundancy serves as an accurate benchmark, while additional penalty terms for two-stage coding become negligible for large $d$ . In nonstandard settings (e.g., mixtures with heavy tails), the Jeffreys integral may diverge, limiting applicability of classic finite-sample redundancy laws.

5. Pseudocodeword Redundancy in Linear Codes

Pseudocodeword redundancy measures the minimum number of parity-check rows in a matrix $H$ so that all non-zero pseudocodewords have weight at least $d$ , the code’s minimum Hamming distance (Zumbragel et al., 2010, Zumbrägel et al., 2011). For iterative or LP decoding, this represents the finite-sample constraint needed to eliminate low-weight pseudocodewords and match ML decoding performance. Most random codes exhibit infinite pseudocodeword redundancy, but for codes based on designs (e.g., BIBDs) and cyclic codes meeting the Vontobel–Koetter eigenvalue bound,

$w_{\min,\text{AWGNC}} \geq n \frac{2w_c - \mu_2}{w_c^2 - \mu_2},$

finite redundancy is attainable. This trade-off connects structural code properties and practical decoder design in finite regimes.

6. Redundancy Allocation Laws in Partitioned Codes

For finite-length nested (partitioned) codes in nonvolatile memory applications, redundancy must be allocated between defect masking ( $l$ bits) and error correction ( $r$ bits), under constraints $l + r = n-k$ (Kim et al., 2018). Recovery failure probability is bounded as: $P(\hat{\mathbf{m}} \neq \mathbf{m}) \leq 2^{-l}(1+\beta)^n + 2^{-r}(1+\alpha)^n,$ where $\beta$ is the defect probability, $\alpha$ is the erasure or crossover probability. The optimal allocation is estimated analytically (by KKT conditions) and matches simulation optima, underscoring the non-triviality of finite-sample performance compared to asymptotic theory.

7. Redundancy Laws in Structural Optimization and Function Approximation

Structural redundancy, formalized in robust optimization via information-gap theory, quantifies the maximal degradation ( $\alpha$ ) sustainable without exceeding performance thresholds, with worst-case performance $h^\text{worst}(x; \alpha)$ (Kanno, 2016). Multiple damage scenarios yield non-differentiable optimization landscapes; algorithmic approaches such as derivative-free SQP leverage finite-difference gradients to navigate these constraints efficiently. In linear function approximation with numerically redundant bases (e.g., frames or overcomplete dictionary), numerical regularization (e.g., $\ell^2$ or TSVD) reduces required sample size, replacing the nominal dimensionality $n$ with an effective dimension $n^\varepsilon$ such that

$m \geq C n^\varepsilon \log n^\varepsilon$

for accurate recovery (Herremans et al., 13 Jan 2025).

8. Redundancy Laws in Function-Correcting Codes and Feature Learning

Function-correcting codes over finite fields require redundancy $r_f(k,t)\geq 2t$ (Ly et al., 19 Apr 2025). In large fields ( $q\geq k+2t$ ), optimal systematic MDS codes achieve $r=2t$ , while in binary and moderate-sized fields,

$2t \leq r_f(k,t) < \frac{t\log(2k)}{1-t\log e},$

demonstrating logarithmic overhead with dimension. These explicit finite-length laws guide practical code constructions.

In deep learning, finite-sample scaling laws are shown to be redundancy laws (Bi et al., 25 Sep 2025). Kernel regression under a covariance spectrum with $\lambda_i \propto i^{-1/\beta}$ yields excess risk decaying as $n^{-\alpha}$ where

$\alpha = \frac{2s}{2s + 1/\beta},$

with $s$ (source condition) and redundancy parameter $1/\beta$ . Universality is established across invertible transforms, mixture domains, finite-width models, and Transformers, demonstrating that the scaling exponent is not universal but dictated by data redundancy.

Summary Table: Key Redundancy Laws and Scaling

Context	Scaling Law / Bound	Governing Parameters
Precision–Redundancy (Source coding)	$W \lesssim \eta\log_2(m/R)$	$\eta$ : code-dependent constant, $m$ , $R$
Delay–Redundancy (Sequential codes)	$R(P,d)\lesssim 2^{-dH_2(P)}$	$H_2(P)$ : Rényi entropy order 2, $d$
Universal coding (infinite alphabet)	$R(\mathcal{P}^\infty) = T(\mathcal{P})$	Tail redundancy $T(\mathcal{P})$
Minimax Redundancy in Parametric Models	$R_n = \frac{d}{2}\log n + \log J$	$d$ : param dim., $J$ : Jeffreys integral
Partitioned/Nested Codes	$P(\hat{m}\neq m)\leq 2^{-l}(1+\beta)^n + 2^{-r}(1+\alpha)^n$	$l$ , $r$ , $\alpha$ , $\beta$
Function Approximation (Frames)	$m\geq Cn^\varepsilon\log n^\varepsilon$	$n^\varepsilon$ : eff. dim via regularization
Function-Correcting Codes	$2t \leq r < [t\log(2k)]/(1-t\log e)$	$k$ : dim., $t$ : error level
Deep Learning Scaling (Redundancy Law)	$\alpha = 2s/(2s+1/\beta)$	$s$ : smoothness, $\beta$ : spectral tail

Finite-sample redundancy laws reveal the precise mechanisms by which resource constraints and discrete, non-asymptotic phenomena induce excess risk, inefficiency, or code length, and provide critical guidance for algorithm and system design across multiple disciplines. These laws unify previously disparate observations on scaling, robustness, and regularization, making explicit the fundamental role of redundancy in practical applications.