Gradient-Based Meta-Learning via Hierarchical Bayes

Updated 23 January 2026

The paper introduces a hierarchical Bayesian formulation that recasts gradient-based meta-learning as empirical Bayes inference using MAP approximations.
It leverages techniques such as Laplace approximations, K-FAC, and Gradient-EM to efficiently estimate curvature and improve scalability.
The framework motivates extensions like Dirichlet process mixtures to handle non-stationary task distributions and enhance model flexibility.

Gradient-based meta-learning as hierarchical Bayes formalizes fast adaptation to new tasks in terms of empirical Bayes inference within multi-level probabilistic models. This probabilistic lens illuminates both the theoretical underpinnings and practical extensions of algorithms such as Model-Agnostic Meta-Learning (MAML), and motivates extensions based on approximate inference, mixture models, and curvature-aware optimization (Grant et al., 2018, Jerfel et al., 2018, Zou et al., 2020).

1. Hierarchical Bayesian Formulation of Meta-Learning

Meta-learning considers data collected from a distribution over tasks. The hierarchical Bayesian model posits (i) a global "meta"-parameter, and (ii) for each task $i$ , task-specific parameters, generating data via a likelihood model. Denote:

$\theta$ : global meta-parameter,
$\phi_i$ : task-specific parameter for task $i$ ,
$D_i^\mathrm{tr}, D_i^\mathrm{val}$ : per-task train and validation datasets.

The generative process is:

$p(\theta)$ : prior over $\theta$ ,
$p(\phi_i|\theta)$ : prior over $\phi_i$ induced by $\theta$ ,
$p(D_i|\phi_i)$ : likelihood of task data given $\phi_i$ .

The joint distribution is

$p(\theta,\{\phi_i\},\{D_i\}) = p(\theta) \prod_{i=1}^J p(\phi_i|\theta) p(D_i|\phi_i).$

Empirical Bayes seeks $\theta^*$ maximizing the marginal likelihood of observed data:

$\theta^* = \underset{\theta}{\arg\max} \prod_{i=1}^J \int p(D_i|\phi_i) p(\phi_i|\theta)\, d\phi_i,$

which is generally intractable for high-dimensional $\phi_i$ (e.g., neural networks) (Grant et al., 2018).

2. Gradient-Based Meta-Learning as Marginal Likelihood Approximation

MAML approximates the intractable integrals in the hierarchical model by using maximum a posteriori (MAP) or Laplace approximations. Concretely:

For each task, the integral $\int p(D_i|\phi_i) p(\phi_i|\theta) d\phi_i$ is replaced by evaluating at the MAP estimate $\phi_i^*$ :

$\phi_i^* = \underset{\phi}{\arg\max} \left[\log p(D_i^\mathrm{tr}|\phi) + \log p(\phi|\theta)\right].$

With a Gaussian prior $p(\phi|\theta) = \mathcal{N}(\theta, \Sigma_0)$ , this becomes ridge-regularized minimization:

$\phi_i^* = \underset{\phi}{\arg\min}\left\{L_i^\mathrm{tr}(\phi) + \frac{1}{2\alpha}\|\phi - \theta\|^2\right\},$

where $L_i^\mathrm{tr}(\phi) = -\log p(D_i^\mathrm{tr}|\phi)$ .

Initializing at $\theta$ and taking $K$ gradient steps yields $\phi_i^K$ approximating $\phi_i^*$ . The meta-objective is then

$L_\mathrm{MAML}(\theta) = \sum_{i=1}^J L_i^\mathrm{val}(\phi_i^K(\theta)),$

which matches the negative log marginal likelihood under the point estimate $p(D_i|\theta) \approx p(D_i^\mathrm{val}|\phi_i^K)$ (Grant et al., 2018, Zou et al., 2020).

3. MAP and Laplace Approximations in the Inner Loop

The inner gradient-based update acts as a MAP or truncated Laplace approximation to the task posterior

$p(\phi_i|D_i^\mathrm{tr},\theta) \propto p(D_i^\mathrm{tr}|\phi_i) p(\phi_i|\theta).$

A full Laplace approximation expands around the mode $\phi_i^*$ , with local curvature $H_i$ (the Hessian at $\phi_i^*$ ), yielding a Gaussian posterior

$p(\phi_i|D_i^\mathrm{tr},\theta) \approx \mathcal{N}(\phi_i^*, H_i^{-1}).$

This leads to the curvature corrected meta-objective:

$L_{\mathrm{HMAML}}(\theta) = -\sum_{i=1}^J\Big[ \log p(D_i^\mathrm{val}|\phi_i^*) - \frac{1}{2}\log\det H_i \Big] + \mathrm{const},$

where the $-\frac{1}{2}\log\det H_i$ penalizes overly flat posteriors (Grant et al., 2018).

4. Scalable Inference: K-FAC and Gradient-EM

Computing the full Hessian $H_i$ is computationally prohibitive; it is replaced by a block-diagonal K-FAC Fisher approximation $\hat H_i$ , where each block is a Kronecker-factorization of the layerwise Fisher matrix:

$F_i^\ell \approx A_\ell \otimes G_\ell,~~ \det(F_i^\ell) = (\det A_\ell)^{\dim G_\ell}(\det G_\ell)^{\dim A_\ell}.$

The log-determinant can therefore be computed efficiently, and the overall meta-algorithm optimizes the curvature-corrected loss via autodiff (Grant et al., 2018).

Gradient-EM (GEM) further improves computational and memory efficiency by decoupling inner updates (E-step) and meta-update (M-step). Instead of unrolling gradients through the inner update trajectory, the meta-gradient is computed with the gradient-EM identity:

$\nabla_\phi \log p(D|\phi) = \mathbb{E}_{\theta\sim p(\theta|D,\phi)}\left[\nabla_\phi \log p(\theta|\phi)\right],$

and posterior expectations are approximated via a small number of inner gradient steps in variational families (Zou et al., 2020). This sidesteps expensive backpropagation through inner updates required in standard MAML.

5. Mixture Extensions: Dirichlet Process Priors for Latent Task Clusters

A limitation of a single global prior is that it can be suboptimal in the presence of heterogeneous or non-stationary task distributions. Using a Dirichlet process (DP) mixture within the hierarchical Bayesian framework allows flexible, nonparametric modeling of task structure (Jerfel et al., 2018). Each cluster parameter $\phi_\ell$ corresponds to a "good" initialization; tasks select a cluster using a CRP prior, and adapt within that cluster:

$G \sim \mathrm{DP}(\alpha, G_0(\cdot|\theta_0))$
For each task $j$ : $z_j \sim \pi$ , $\theta_j \sim p(\theta_j|\phi_{z_j})$ , $D_j \sim p(D_j|\theta_j)$

Stochastic MAP-EM approximations are used, recapitulating the inner–outer loop pattern. When the DP reduces to a single component ( $\alpha \to 0$ ), this setup coincides exactly with standard MAML. Otherwise, the meta-learner can track and instantiate new clusters, thereby adapting to task distribution shifts and mitigating negative transfer.

6. Algorithmic Summaries

An overview of the implementation, highlighting differences between key algorithms:

Approach	Inner Update	Meta-Update	Posterior Approximation
MAML	$K$ GD steps	Unrolled gradient through inner	Point (mode/MAP)
HMAML	$K$ GD steps + K-FAC	Unrolled $+\;$ curvature penalty	Laplace around MAP
GEM-Bayes	S variational steps	Gradient-EM identity	Variational (Gaussian)
DP-MAML	K GD/EM per cluster	EM over clusters	DP mixture MAP

In HMAML, K-FAC is used for scalable curvature estimation (Grant et al., 2018).
GEM-Bayes applies the EM identity for meta-gradients, decoupling inner optimization and improving memory/computational efficiency (Zou et al., 2020).
DP-MAML enables online mixture adaptation for non-stationary tasks, robust to latent distribution shifts (Jerfel et al., 2018).

7. Implications and Theoretical Significance

Viewing gradient-based meta-learning as hierarchical Bayes unifies disparate algorithms through a probabilistic perspective. This connection justifies inner gradient steps as MAP or approximate Bayesian inference, grounds meta-objectives as marginal likelihood maximization, and directly motivates extensions:

Curvature corrections via Laplace/K-FAC penalize flat posteriors and sharpen uncertainty quantification (Grant et al., 2018).
Mixture/DPM models allow for flexible, task-adaptive priors and effective continual learning (Jerfel et al., 2018).
Gradient-EM techniques improve scalability for deep neural applications and enable richer variational inference with controlled overfitting (Zou et al., 2020).

This framework suggests further directions including mixture-of-modes, improved curvature approximations, and variational extensions, all operating within a coherent probabilistic meta-learning paradigm.

Markdown Report Issue Upgrade to Chat

References (3)

Recasting Gradient-Based Meta-Learning as Hierarchical Bayes (2018)

Reconciling meta-learning and continual learning with online mixtures of tasks (2018)

Gradient-EM Bayesian Meta-learning (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient-Based Meta-Learning as Hierarchical Bayes.