Power/Tempered Posteriors

Updated 27 January 2026

Power/tempered posteriors are a Bayesian generalization that incorporate a temperature parameter to control the influence of the likelihood relative to the prior.
They underpin methods like thermodynamic integration, annealed MCMC, and PAC-Bayesian analysis, providing robust finite-sample guarantees and improved computational tractability.
Applications span model selection, deep Bayesian learning, and Gaussian process regression, where tuning the temperature can enhance predictive performance and uncertainty calibration.

Power posteriors, also known as tempered posteriors, are a generalization of the standard Bayesian posterior that introduce a power or temperature parameter to control the relative influence of the likelihood and prior. Formally, given a prior $p(\theta)$ and likelihood $p(D|\theta)$ for data $D$ , the "power" posterior at power $\beta > 0$ or temperature $T = 1/\beta$ is defined as $p_\beta(\theta|D) \propto p(D|\theta)^\beta p(\theta)$ , or equivalently $p_T(\theta|D) \propto \exp(-U(\theta)/T)$ , where $U$ is the negative joint log probability. Power/tempered posteriors have become central in Bayesian computation, model selection, PAC-Bayesian theory, and deep learning, both as a practical tool for controlling model fit and as a theoretical device for finite-sample guarantees, computational robustness, and algorithmic efficiency.

1. Definition and Mathematical Formulation

A power or tempered posterior generalizes the traditional Bayesian update rule by introducing a scalar exponent $\gamma > 0$ (or its inverse, the "temperature" $T = 1/\gamma$ ) applied to the likelihood:

$p_\gamma(\theta|D) \propto p(D|\theta)^\gamma p(\theta), \qquad \text{or equiv.} \qquad p_T(\theta|D) \propto \exp\left( -U(\theta)/T \right)$

When $\gamma=1$ ( $T=1$ ), this coincides with the standard Bayesian posterior.
For $\gamma>1$ ( $T<1$ , "cold" posterior), the likelihood is up-weighted relative to the prior, producing a more concentrated posterior.
For $\gamma<1$ ( $T>1$ , "hot" posterior), the posterior is more diffuse, increasingly dominated by the prior (Adlam et al., 2020).

This definition is universal and applies in finite-dimensional parametric models, infinite-dimensional function spaces, and discrete or continuous parameter settings.

2. Theoretical Properties and Frequentist Implications

Power posteriors possess several theoretically significant properties:

Frequentist Generalization and PAC-Bayes: In PAC-Bayesian theory, the tempered/gibbs/power posterior arises naturally as the minimizer of a trade-off between empirical loss and a Kullback–Leibler divergence penalty over a prior, typically appearing with a scaling factor (temperature) balancing empirical risk against complexity (Zhang, 2024, Pitas et al., 2023, Perrotta, 2020, Banerjee et al., 2021).

The PAC-Bayes bound for risk $L(\rho)$ of a randomized estimator $\rho$ often takes the form:

$\mathbb{E}_{p_\lambda}[l] \leq \mathbb{E}_{p_\lambda}[\hat l] + \frac{1}{n}\left( KL(p_\lambda \|\pi) + \log\frac{2 \sqrt n}{\delta} \right)$

where $\lambda$ is the inverse temperature (Zhang, 2024).

Robustness and Concentration: Fractional/fractional--power posteriors with $\gamma<1$ generally have more favorable robustness properties under model misspecification and require weaker conditions for posterior concentration. Many classical theorems (e.g., Bernstein–von Mises, finite-sample risk bounds, nonasymptotic concentration) extend with milder assumptions to $\gamma<1$ (Alquier et al., 2017, Jaiswal et al., 2023).
Equivalence in Gaussian Models: In linear Gaussian models, changing the temperature is analytically equivalent to scaling the prior variance or noise variance, making power posterior tuning mathematically identical to empirical Bayes (type-II ML) hyperparameter selection (Adlam et al., 2020).

Specifically, the predictive mean stays unchanged, and the variance is simply scaled by $T$ (Adlam et al., 2020).

3. Computational Uses: Power Posteriors in Marginal Likelihood and MCMC

Power posteriors are foundational in several computational algorithms:

Thermodynamic Integration for Marginal Likelihood: The marginal likelihood (evidence) can be expressed as a path integral over powers of the posterior:

$\log m(y) = \int_{0}^{1} \mathbb{E}_{\pi_t} [\log p(y|\theta)] dt\,$

where $\pi_t(\theta|y) \propto p(y|\theta)^t \pi(\theta)$ (Calvo et al., 2022). This provides the basis for thermodynamic integration and annealed importance sampling, which are algorithmically robust for estimating evidence and Bayes factors.

Tempered and Annealed MCMC: Raising the likelihood to a fractional power flattens the posterior energy landscape, allowing tempered and parallel-tempering MCMC and SMC algorithms to better traverse multimodal posteriors, accelerate mixing, and improve variance reduction. In particular, MINT-MCMC connects mini-batch approximate MCMC acceptance ratios directly to tempered posteriors of analytically known temperature (Li et al., 2017).
Variance Reduction in Sequential Monte Carlo: Knowledge of the sequence of tempered expectations enables extrapolation to posterior expectations at the target distribution, reducing Monte Carlo variance in SMC with little additional cost (Xi et al., 15 Sep 2025).

4. Tempered Posteriors in Deep Learning and Statistical Modeling

Numerous empirical observations in Bayesian deep learning and applied statistics motivate the explicit adjustment of posterior temperature:

Cold Posterior Effect: Empirical results in Bayesian neural networks indicate that predictive performance (test likelihood, test accuracy) is maximized at values $\gamma^*\gg1$ $γ^{*} ≫ 1$ ( $T^*\ll1$ $T^{*} ≪ 1$ ), i.e., posteriors more concentrated than the standard Bayesian update. This phenomenon is pronounced on "curated" datasets with low aleatoric uncertainty (e.g., CIFAR-10) (Adlam et al., 2020, Aitchison, 2020, Zhang, 2024).
- In probabilistic neural networks and deep Bayesian learning, tempering the posterior guards against overestimated aleatoric uncertainty from standard priors, making the inferred predictive probability distributions tighter and leading to improved calibration and accuracy on low-noise datasets (Adlam et al., 2020).
Curation, Misspecification, and Tempering: Modeling the data curation process (e.g., rejecting ambiguous labels) leads naturally—on marginalizing over unobserved latent inclusion indicators—to posterior updates of tempered (cold) form $p_T(\theta|D) \propto p(D|\theta)^{1/T}p(\theta)$ with $T<1$ (Aitchison, 2020). Thus, the observed cold-posterior effect can be viewed as a reflection of the actual data-generating mechanism.
Bayesian Optimization and Robustness: In Bayesian optimization using Gaussian Process surrogates, local model misspecification and overconcentration can be mitigated by down-weighting the likelihood (i.e., tempering). The effect of the tempering parameter can be analytically understood as an inflation of the noise variance, leading to improved exploration and sharper cumulative regret bounds (Li et al., 11 Jan 2026).

5. Variational Inference, Calibration, and Practical Model Selection

The temperature parameter also arises in variational inference, PAC-Bayesian learning, and practical Bayesian procedures:

Variational Tempered Posteriors: Variational Bayes (VB) inference of the tempered posterior optimizes a modified evidence lower bound (ELBO) with the log-likelihood scaled by $\gamma$ . This modification is critical to avoid overconfidence in misspecified models and can be extended analytically to a wide range of VB solvers, including mean-field and Gaussian variational families (Alquier et al., 2017, Perrotta, 2020, Banerjee et al., 2021).
Temperature (α) Calibration: Several empirical schemes, notably sample-splitting (hold-out) and bootstrap calibration, have been developed to select the optimal temperature $\alpha$ by minimizing predictive loss on validation data. These methods are fast, robust, and applicable in both exact and variational settings (Perrotta, 2020). In practice, correctly tuning $\alpha$ systematically improves predictive accuracy and uncertainty calibration, especially under model misspecification.
Connections to Empirical Bayes: In Gaussian settings (e.g., GP regression, linear models), selecting the temperature is formally equivalent to empirical Bayes maximization over the marginal likelihood with respect to prior hyperparameters or noise variance (Adlam et al., 2020).

Method or Setting	Role of Temperature ( $\gamma$ , $T$ , or $\alpha$ )	Key Effect
Marginal likelihood	Annealing pathway in thermodynamic integration	Smooths path from prior to post.
MCMC/SMC/SGLD	Controls posterior sharpness for mixing/variance reduction	Accelerates exploration
GP regression	Identical to scaling noise or kernel variance	Matches empirical Bayes
BNNs/classification	Compensates for overconfident uncertainties	Reduces aleatoric overfit
PAC-Bayes/VI	Rescales tradeoff between fit and complexity	Leads to tighter risk bounds

6. Predictive Performance, Asymptotics, and Limitations

Asymptotic Neutrality: In moderate-to-large samples, the choice of temperature ceases to matter for one-step-ahead predictive distributions. Specifically, under standard regularity and posterior concentration ( $n\alpha\to\infty$ ), the predictive $p_n^{(\tau)}(y_{n+1}\mid y)$ is uniformly (in $\tau$ ) close to the plug-in predictive based on the MLE, with errors $O(1/\sqrt{n})$ (McLatchie et al., 2024). For moderate $n$ , tempering may affect predictive risk, especially with highly informative priors or under model misspecification, but asymptotically all "reasonable" $\tau$ yield the same predictions.
Trade-offs and Tuning: The optimal temperature depends on the statistical goal and loss. For instance, cold posteriors may favor test accuracy but degrade calibration. Improvements in calibration or negative log-likelihood typically require sacrificing pure accuracy (Pitas et al., 2023). There is not a universal optimal temperature; it is an empirical and task-dependent hyperparameter.
Not Just Likelihood Rescaling: PAC-Bayesian analysis shows the temperature parameter cannot in general be simply interpreted as a correction for misspecified likelihood variance or prior scale; it contributes additional flexibility in balancing curvature, mode, and estimator randomness (Pitas et al., 2023).

7. Open Problems and Extensions

Future challenges and directions include:

Theory for Non-Tempered Posteriors: Extension of finite-sample risk analyses and Bernstein–von Mises theorems from fractional posteriors ( $\alpha<1$ ) to classical Bayes ( $\alpha=1$ ) in full generality, especially outside sub-Gaussian or parametric models (Jaiswal et al., 2023, Alquier et al., 2017).
Adaptive and Automatic Tempering: Development of online or data-driven temperature schedules (e.g., prequential, maximum-likelihood-based in inverse problems) for sequential and high-dimensional Bayesian computation (Martino et al., 2021, Li et al., 11 Jan 2026).
Variance Reduction Techniques: Utilizing analycity of the map $t\mapsto \mathbb{E}_{p_t}[f(X)]$ for post-hoc variance reduction in sequential Monte Carlo via extrapolation of tempered expectations (Xi et al., 15 Sep 2025).
Robustness to Misspecification and Heavy Tails: Generalization of tempered/Bayesian methods to models with nonstandard loss, heavy-tailed or structured data—an open area for both theory and application (Jaiswal et al., 2023).

Power (tempered) posteriors thus serve as a core interface between Bayesian regularization, calibration, computational tractability, and frequentist risk, with ongoing theoretical and practical developments across modern machine learning, statistics, and computational science.