Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayes@N: Linking NML and Bayesian Inference

Updated 3 January 2026
  • Bayes@N is a principle where the normalized maximum likelihood (NML) distribution exactly matches Bayesian predictive and marginal distributions using a sample-size-specific prior.
  • It guarantees minimax coding regret and finite-sample Bayesian coherence, enabling optimal sequential prediction and efficient computation.
  • By leveraging sufficient statistics and grid-based weights, Bayes@N unites MDL principles with Bayesian inference, offering both genuine mixtures and valid signed priors.

Bayes@N denotes the principle that, for every fixed sample size NN, the normalized maximum likelihood (NML) distribution associated with a statistical model coincides exactly with the Bayesian predictive and marginal distributions for that sample size, under a specific (possibly signed) prior wNw_N. This property was established for arbitrary parametric families of i.i.d. distributions, providing a conceptual link between Bayesian inference and the minimum description length (MDL) paradigm in universal coding, gambling, and prediction. Bayes@N guarantees minimax regret while delivering finite-sample Bayesian coherence (such as conditionalization and exchangeability of updates), regardless of sample size. Notably, for certain models, the corresponding prior can be chosen nonnegative, yielding a genuine Bayes mixture (Barron et al., 2014).

1. Core Definitions and Mathematical Foundations

Consider a statistical model specified by i.i.d. densities or mass functions f(xnθ)=i=1nf(xiθ)f(x^n|\theta) = \prod_{i=1}^n f(x_i|\theta), with parameter θ\theta belonging to a parameter space Θ\Theta and sample space X\mathcal{X}. The normalized maximum likelihood distribution is defined as

PNML(xn)=m(xn)Cn,P_\text{NML}(x^n) = \frac{m(x^n)}{C_n},

where m(xn)=maxθf(xnθ)m(x^n) = \max_\theta f(x^n|\theta) and Cn=ynXnm(yn)C_n = \sum_{y^n \in \mathcal{X}^n} m(y^n) (or the analogous integral for continuous spaces). Shtarkov’s theorem ensures that PNMLP_\text{NML} uniquely achieves the minimax pointwise regret

minqmaxxn[logq(xn)+logm(xn)]\min_q \max_{x^n} \left[-\log q(x^n) + \log m(x^n)\right]

for coding and prediction.

The principle Bayes@N asserts that for each nn, there exists a prior measure wnw_n (potentially signed) such that

PNML(xn)=Θwn(θ)  f(xnθ)  dθ,P_\text{NML}(x^n) = \int_\Theta w_n(\theta)\;f(x^n|\theta)\;d\theta,

and for any data prefix xix^i,

PNML(xi)=k=1MWk,nf(xiθk)/Cn,P_\text{NML}(x^i) = \sum_{k=1}^M W_{k,n} f(x^i|\theta_k)/C_n,

where Wk,nW_{k,n} are explicit weights derived via a linear system involving sufficient statistics (Barron et al., 2014).

2. Mixture Representation of NML and the “Bayes@N Prior”

The Bayes@N mixture representation exploits sufficient statistics T(xn)T(x^n) taking MM distinct values. For an appropriately chosen MM-point grid θ1,,θM\theta_1,\dots,\theta_M with independent model vectors pT(θk)p_T(\cdot|\theta_k), the maximized distribution mTm_T is decomposed by solving

mT(t)=k=1MpT(tθk)Wk,n,m_T(t) = \sum_{k=1}^M p_T(t|\theta_k) W_{k,n},

yielding a discrete prior wn(θ)=k=1MWk,nδθk(θ)w_n(\theta) = \sum_{k=1}^M W_{k,n}\delta_{\theta_k}(\theta). The condition for a genuine Bayesian mixture (wn0w_n \ge 0) is that mTm_T lies in the convex hull of {pT(θk)}\{p_T(\cdot|\theta_k)\}; if not, wnw_n exhibits negative parts, but the predictive distribution remains valid and nonnegative.

Illustrative examples: For Bernoulli(θ\theta), M=n+1M=n+1 and the grid θk=k/n\theta_k=k/n with uniform or arcsine spacing recovers Wk,nW_{k,n} by matrix inversion. In the Gaussian location model, weights Wk,nW_{k,n} are determined via deconvolution over suitable grids of the mean parameter.

3. Bayesian Predictives, Minimax Regret, and Computational Efficiency

By construction,

PNML(xixi1)=PNML(xi)/PNML(xi1)P_\text{NML}(x_i|x^{i-1}) = P_\text{NML}(x^i)/P_\text{NML}(x^{i-1})

coincides with the Bayesian predictive under wnw_n. This duality ensures that at sample size NN, the predictive updates are exactly Bayesian, conditionalization holds, and the sequence of updates is exchangeable.

Computationally, marginals and predictives require only O(M)O(M) time per forward step once weights Wk,nW_{k,n} and model evaluations are cached, and the full weight solution is O(M3)O(M^3) (e.g., O(n3)O(n^3) for Bernoulli). Fast algebraic techniques (e.g., Vandermonde or Hankel structure for binomial models) further reduce this cost to O(n2)O(n^2) or O(nlog2n)O(n\,\log^2 n). A direct enumeration for Xni|\mathcal{X}|^{n-i} continuations is infeasible beyond moderate nn (Barron et al., 2014).

4. Guarantees: Minimax Coding and Exact Bayesian Coherence

The mixture representation provides two fundamental guarantees:

  • The codelength logPNML(xn)-\log P_\text{NML}(x^n) equals the minimax worst-case regret, i.e., it never exceeds by more than logCn\log C_n the code length of the best parameter θ\theta^* for any data sequence.
  • The one-step predictives (for each horizon NN) coincide precisely with Bayesian inference under wnw_n, with all conditionalization and finite-sample Bayesian properties intact.

A plausible implication is that statistical inference, coding, and sequential prediction can be interpreted simultaneously as both minimax-optimal and Bayesian (horizon-dependent) for arbitrary finite samples.

5. Relationship to MDL and Bayesian Modeling

Bayes@N resolves a long-standing question about the connection between the minimum description length (MDL) principle and Bayesian prediction. The existence of the finite-sample prior wNw_N (even if signed) shows that NML’s minimax regret procedure is, at every NN, a fully Bayesian predictive with a particular, sample-size-specific prior. When wNw_N is positive, all consequences of Bayesian modeling (coherence, valid posterior probabilities, etc.) follow identically; when wNw_N is signed, the mixture remains algebraically and algorithmically valid for predictive purposes but loses the interpretation of uncertainty over parameters.

6. Extensions: Examples and Practical Implications

For Bernoulli and multinomial models, MM is polynomial in nn; also, arcsine grids allow nonnegative wnw_n and thus true Bayes mixtures. In Gaussian location families, symmetrical grids may sometimes yield nonnegative weights when intervals are sufficiently broad.

The property Bayes@N enables significantly faster computations of marginals and conditionals by reducing high-dimensional sums to algebraic evaluations using the precomputed prior, providing scalable algorithms for universal coding and prediction consistent both with MDL and finite-sample Bayesian reasoning.

7. Limitations, Signed Priors, and Generalizations

If the mixture weights wnw_n cannot be chosen nonnegative, the solution defaults to signed priors, which lack interpretation as probability measures over parameters but suffice for algebraic and operational validity. The necessary condition for a positive mixture is model-dependent and relates to the geometry of sufficient statistics and model vectors.

A plausible implication is that signed priors may be unavoidable in highly regularized or under-determined model classes, but for many practical models (one-parameter exponential families, large-grid multinomials, suitably chosen grids), honest positive priors are feasible, permitting Bayes@N to blend Bayesian inference and minimax regret seamlessly.

The central result is that, for every NN, NML distributions are Bayesian under a specifically constructed sample-size-dependent prior wNw_N, yielding both rigorous minimax coding and Bayesian predictives, and enabling efficient computation of posterior marginals and conditionals with exact guarantees (Barron et al., 2014).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayes@N.