Bayes@N: Linking NML and Bayesian Inference
- Bayes@N is a principle where the normalized maximum likelihood (NML) distribution exactly matches Bayesian predictive and marginal distributions using a sample-size-specific prior.
- It guarantees minimax coding regret and finite-sample Bayesian coherence, enabling optimal sequential prediction and efficient computation.
- By leveraging sufficient statistics and grid-based weights, Bayes@N unites MDL principles with Bayesian inference, offering both genuine mixtures and valid signed priors.
Bayes@N denotes the principle that, for every fixed sample size , the normalized maximum likelihood (NML) distribution associated with a statistical model coincides exactly with the Bayesian predictive and marginal distributions for that sample size, under a specific (possibly signed) prior . This property was established for arbitrary parametric families of i.i.d. distributions, providing a conceptual link between Bayesian inference and the minimum description length (MDL) paradigm in universal coding, gambling, and prediction. Bayes@N guarantees minimax regret while delivering finite-sample Bayesian coherence (such as conditionalization and exchangeability of updates), regardless of sample size. Notably, for certain models, the corresponding prior can be chosen nonnegative, yielding a genuine Bayes mixture (Barron et al., 2014).
1. Core Definitions and Mathematical Foundations
Consider a statistical model specified by i.i.d. densities or mass functions , with parameter belonging to a parameter space and sample space . The normalized maximum likelihood distribution is defined as
where and (or the analogous integral for continuous spaces). Shtarkov’s theorem ensures that uniquely achieves the minimax pointwise regret
for coding and prediction.
The principle Bayes@N asserts that for each , there exists a prior measure (potentially signed) such that
and for any data prefix ,
where are explicit weights derived via a linear system involving sufficient statistics (Barron et al., 2014).
2. Mixture Representation of NML and the “Bayes@N Prior”
The Bayes@N mixture representation exploits sufficient statistics taking distinct values. For an appropriately chosen -point grid with independent model vectors , the maximized distribution is decomposed by solving
yielding a discrete prior . The condition for a genuine Bayesian mixture () is that lies in the convex hull of ; if not, exhibits negative parts, but the predictive distribution remains valid and nonnegative.
Illustrative examples: For Bernoulli(), and the grid with uniform or arcsine spacing recovers by matrix inversion. In the Gaussian location model, weights are determined via deconvolution over suitable grids of the mean parameter.
3. Bayesian Predictives, Minimax Regret, and Computational Efficiency
By construction,
coincides with the Bayesian predictive under . This duality ensures that at sample size , the predictive updates are exactly Bayesian, conditionalization holds, and the sequence of updates is exchangeable.
Computationally, marginals and predictives require only time per forward step once weights and model evaluations are cached, and the full weight solution is (e.g., for Bernoulli). Fast algebraic techniques (e.g., Vandermonde or Hankel structure for binomial models) further reduce this cost to or . A direct enumeration for continuations is infeasible beyond moderate (Barron et al., 2014).
4. Guarantees: Minimax Coding and Exact Bayesian Coherence
The mixture representation provides two fundamental guarantees:
- The codelength equals the minimax worst-case regret, i.e., it never exceeds by more than the code length of the best parameter for any data sequence.
- The one-step predictives (for each horizon ) coincide precisely with Bayesian inference under , with all conditionalization and finite-sample Bayesian properties intact.
A plausible implication is that statistical inference, coding, and sequential prediction can be interpreted simultaneously as both minimax-optimal and Bayesian (horizon-dependent) for arbitrary finite samples.
5. Relationship to MDL and Bayesian Modeling
Bayes@N resolves a long-standing question about the connection between the minimum description length (MDL) principle and Bayesian prediction. The existence of the finite-sample prior (even if signed) shows that NML’s minimax regret procedure is, at every , a fully Bayesian predictive with a particular, sample-size-specific prior. When is positive, all consequences of Bayesian modeling (coherence, valid posterior probabilities, etc.) follow identically; when is signed, the mixture remains algebraically and algorithmically valid for predictive purposes but loses the interpretation of uncertainty over parameters.
6. Extensions: Examples and Practical Implications
For Bernoulli and multinomial models, is polynomial in ; also, arcsine grids allow nonnegative and thus true Bayes mixtures. In Gaussian location families, symmetrical grids may sometimes yield nonnegative weights when intervals are sufficiently broad.
The property Bayes@N enables significantly faster computations of marginals and conditionals by reducing high-dimensional sums to algebraic evaluations using the precomputed prior, providing scalable algorithms for universal coding and prediction consistent both with MDL and finite-sample Bayesian reasoning.
7. Limitations, Signed Priors, and Generalizations
If the mixture weights cannot be chosen nonnegative, the solution defaults to signed priors, which lack interpretation as probability measures over parameters but suffice for algebraic and operational validity. The necessary condition for a positive mixture is model-dependent and relates to the geometry of sufficient statistics and model vectors.
A plausible implication is that signed priors may be unavoidable in highly regularized or under-determined model classes, but for many practical models (one-parameter exponential families, large-grid multinomials, suitably chosen grids), honest positive priors are feasible, permitting Bayes@N to blend Bayesian inference and minimax regret seamlessly.
The central result is that, for every , NML distributions are Bayesian under a specifically constructed sample-size-dependent prior , yielding both rigorous minimax coding and Bayesian predictives, and enabling efficient computation of posterior marginals and conditionals with exact guarantees (Barron et al., 2014).