Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Predictive Coding

Updated 1 February 2026
  • Bayesian Predictive Coding is a framework integrating hierarchical probabilistic models, variational Bayesian inference, and local Hebbian learning rules.
  • It employs iterative, amortized, and hybrid inference schemes to minimize precision-weighted prediction errors across multiple neural layers.
  • The approach advances biological plausibility by leveraging local synaptic updates, uncertainty quantification, and efficient convergence in varied learning tasks.

Bayesian Predictive Coding (BPC) is a normative and algorithmic framework uniting hierarchical probabilistic modeling, variational Bayesian inference, and local learning rules into a biologically plausible paradigm of computation and learning in brains and artificial neural circuits. At its core, BPC posits that perception and cognition arise from the recursive minimization of prediction errors within a multilayer generative model, performing implicit or explicit inference of latent causes and updating synaptic weights by local, Hebbian-style rules. This approach grounds a wide variety of computational algorithms and neural architectures, ranging from classical Rao-Ballard predictive coding circuits to modern hybrid inference schemes, and encompasses both the estimation of point beliefs and full uncertainty quantification through variational posteriors.

1. Hierarchical Generative Models and Variational Free Energy

Bayesian Predictive Coding builds upon hierarchical generative models of the form

p(xL,,x0)=p(xL)=0L1p(xx+1),p(\mathbf{x}^{L}, \dots, \mathbf{x}^{0}) = p(\mathbf{x}^{L}) \prod_{\ell=0}^{L-1} p(\mathbf{x}^{\ell} \mid \mathbf{x}^{\ell+1}),

where x0\mathbf{x}^0 denotes observed data (sensory inputs) and x1,,xL\mathbf{x}^1, \dots, \mathbf{x}^L denote latent variables at progressively higher abstraction. Conditionals p(xx+1)p(\mathbf{x}^{\ell} \mid \mathbf{x}^{\ell+1}) are frequently taken as Gaussian, with means parameterized by nonlinear functions of x+1\mathbf{x}^{\ell+1} and weights W+1W^{\ell+1}; however, generalizations to arbitrary exponential families or even more complex distributions are possible (Pinchetti et al., 2022, Tschantz et al., 31 Mar 2025, Salvatori et al., 2023).

Given observations, the goal is to infer the posterior distribution over hidden causes, which is tractable neither analytically nor computationally for deep models. BPC frames this as minimizing the variational free energy functional (negative evidence lower bound, ELBO): F[q]=KL[q(x1:L)p(x1:Lx0)]Eq[lnp(x0x1)]\mathcal{F}[q] = \mathrm{KL}[q(\mathbf{x}^{1:L})\|p(\mathbf{x}^{1:L}| \mathbf{x}^0)] - \mathbb{E}_{q}[\ln p(\mathbf{x}^0|\mathbf{x}^{1})] where qq is a variational distribution, often factorized and parameterized by layerwise means. Under the Laplace (point-mass) approximation, F\mathcal{F} reduces to a sum of precision-weighted squared prediction errors across layers: F({μ})==0L12ε()Σ()1ε()+const\mathcal{F}(\{\mu^\ell\}) = \sum_{\ell=0}^L \frac{1}{2} \varepsilon^{(\ell)\top} \Sigma^{(\ell)-1} \varepsilon^{(\ell)} + \mathrm{const} with prediction errors ε()=μ()f()(μ(+1))\varepsilon^{(\ell)} = \mu^{(\ell)} - f^{(\ell)}(\mu^{(\ell+1)}) and layerwise precision Σ()1\Sigma^{(\ell)-1} (Millidge et al., 2021, Hosseini et al., 2020, Zwol et al., 2024).

2. Inference Dynamics: Iterative, Amortized, and Hybrid Schemes

The minimization of F\mathcal{F} with respect to states and/or parameters constitutes approximate Bayesian inference. Two central approaches are:

  • Iterative (per-sample) inference: Layerwise means μ()\mu^{(\ell)} are updated by gradient descent on F\mathcal{F} for each input, via dynamics such as

μ˙()=Σ()1ε()+J(1)Σ(1)1ε(1),\dot\mu^{(\ell)} = -\Sigma^{(\ell)-1} \varepsilon^{(\ell)} + J^{(\ell-1) \top} \Sigma^{(\ell-1)-1} \varepsilon^{(\ell-1)},

where J(1)J^{(\ell-1)} is the Jacobian of f(1)f^{(\ell-1)} (Millidge et al., 2021).

  • Amortized inference: A parameterized encoder fϕ(x)f_\phi(x) is trained to approximate posterior statistics in a feedforward manner, minimizing the average free energy over the data distribution (Tschantz et al., 2022). This yields rapid, one-pass approximations, at the cost of reduced adaptation to novel or out-of-distribution inputs.

Hybrid predictive coding combines these approaches, interpreting the feedforward "sweep" as amortized inference and subsequent recurrent dynamics as iterative refinement; computation is adaptively allocated based on uncertainty (e.g., the stopping criterion F(μ,x)<τF(\mu, x) < \tau) (Tschantz et al., 2022).

3. Learning Rules, Locality, and Biological Plausibility

Parameter learning is achieved via local Hebbian-style rules based on prediction errors: ΔW()Σ()1ε()(μ(+1)),\Delta W^{(\ell)} \propto \Sigma^{(\ell)-1} \varepsilon^{(\ell)} (\mu^{(\ell+1)})^\top, where each synaptic update depends only on pre- and post-synaptic activities and local error signals, requiring no global error backpropagation (Millidge et al., 2021, Zwol et al., 2024, Tschantz et al., 31 Mar 2025).

Advanced formulations—such as Bayesian Predictive Coding (BPC) with conjugate priors—admit closed-form updates for parameter posteriors (e.g., Matrix-Normal–Wishart), preserving locality while quantifying uncertainty. The emergent precisions (posterior variances) can be interpreted as dynamic gain signals, aligning with neuromodulatory hypotheses in cortical circuits (Tschantz et al., 31 Mar 2025, Jiang et al., 2021).

Canonical architectural motifs, shared across theoretical and experimental frameworks, include:

  • Paired populations per layer representing value (state) units and error units
  • Bidirectional connectivity: feedback conveys predictions, feedforward conveys prediction errors (Rao-Ballard protocol) (Hosseini et al., 2020, Jiang et al., 2021)
  • Strictly local update rules for both inference and synaptic plasticity

4. Uncertainty Quantification and Adaptive Computation

Full Bayesian variants of predictive coding maintain variational posteriors over synaptic weights and, optionally, over hidden states. For example, a posterior

q(W,Σ)=MN(WM,Σ1,V)W(Σ1Ψ,ν)q(\mathbf{W}_\ell, \boldsymbol{\Sigma}_\ell) = \mathcal{MN}(\mathbf{W}_\ell \mid \mathbf{M}_\ell, \boldsymbol{\Sigma}_\ell^{-1}, \mathbf{V}_\ell) \mathcal{W}(\boldsymbol{\Sigma}_\ell^{-1} \mid \boldsymbol{\Psi}_\ell, \nu_\ell)

yields direct expressions for epistemic uncertainty. Predictive mean and variance can be computed either analytically or by posterior sampling; these uncertainty estimates are essential for out-of-distribution detection, continual learning, and robust control (Tschantz et al., 31 Mar 2025, Salvatori et al., 2023).

Precision-weighted prediction errors ensure that uncertainty enters both inference and learning updates. In hybrid architectures, the system can allocate more computation (i.e., more recurrent-inference steps) when amortized predictions are uncertain or miscalibrated, as measured by high free-energy (Tschantz et al., 2022).

5. Extensions Beyond Gaussian Models

Recent research generalizes BPC beyond Gaussian assumptions by allowing arbitrary exponential-family or tractable distributions at each layer. The variational free energy is formulated as a sum of layerwise divergences

FKL(ϕ,θ)==0LDKL[X(ϕ)X^(μ)],\mathcal{F}_{KL}(\phi, \theta) = \sum_{\ell=0}^L D_{KL}[ \mathcal{X}_\ell(\phi_\ell) \| \widehat{\mathcal{X}}_\ell(\mu_\ell) ],

where X\mathcal{X}_\ell and X^\widehat{\mathcal{X}}_\ell are the actual and predicted distributions (e.g., softmax-categorical for attention, Beta, or even arbitrary outputs). This enables the training of architectures, such as transformers and VAEs, where standard Gaussian PC is unsuitable (Pinchetti et al., 2022). Updates remain strictly local, relying on layerwise divergences.

Key experimental findings demonstrate that generalized-PC closes the gap with backpropagation on tasks (e.g., classification, language modeling) previously inaccessible to Gaussian PC, while retaining BPC’s locality and biological plausibility (Pinchetti et al., 2022).

6. Empirical Performance and Algorithmic Efficiency

Empirical studies confirm that BPC and its hybrids achieve competitive or superior convergence and generalization on canonical ML benchmarks (MNIST, CIFAR-10/100, SVHN), matching or exceeding backpropagation in calibration, sample efficiency, and robustness—especially under small or corrupted datasets (Salvatori et al., 2022, Salvatori et al., 2023, Zwol et al., 2024).

Incremental Predictive Coding (iPC) removes the need for multi-phase inference-learning cycles, updating both states and weights in parallel at every step and attaining favorable complexity: in deep architectures, iPC executes weight updates with constant parallel complexity, while traditional backprop or PC scale with depth (Salvatori et al., 2022).

Applications include:

  • Supervised and unsupervised learning (classification, generative modeling, denoising, associative memory)
  • Meta-RL: BPC modules induce RNNs to learn Bayes-optimal belief representations, outperforming RL² in partial observability (Kuo et al., 24 Oct 2025)
  • Active inference and model-based control, with evidence for sample-efficient adaptation in robotics and continual learning (Salvatori et al., 2023)

7. Theoretical and Neuroscientific Significance

Bayesian Predictive Coding bridges theoretical neuroscience and machine learning, providing a unifying mathematical formulation and a set of locally implementable schemes for approximate Bayesian inference. The correspondence between BPC gradient descent and fixed-point iterative Bayesian updating is direct; at equilibrium, variational means attain locally consistent posterior estimates (Millidge et al., 2021).

Microcircuit models map “prediction” and “error” units onto cortical deep and superficial layers, with bidirectional synaptic pathways and local modulation by precision signals, in plausible agreement with observed neocortical organization (Hosseini et al., 2020, Jiang et al., 2021).

Ongoing research addresses extensions to arbitrary graphical architectures (PC graphs), spiking implementations, integration of attention/gating mechanisms, and hybrid schemes combining amortized, iterative, and full Bayesian inference (Zwol et al., 2024, Pinchetti et al., 2022).

References

Bayesian Predictive Coding thus stands as a core paradigm for biologically plausible, uncertainty-aware, and highly interpretable learning in hierarchical models, uniting neural computation and principled statistical inference.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Predictive Coding (BPC).