Papers
Topics
Authors
Recent
Search
2000 character limit reached

Last-Layer Laplace Approximation

Updated 2 January 2026
  • Last-layer Laplace Approximation is a Bayesian inference technique that treats only the final neural network layer stochastically, yielding a tractable Gaussian posterior over its weights.
  • It enables efficient uncertainty quantification, fast incremental learning, and improved active and conformal prediction through closed-form updates.
  • The method leverages a second-order Taylor expansion around the MAP estimate and efficient Hessian inversion to bypass full-network retraining, trading off some fidelity for scalability.

The last-layer Laplace approximation is a methodology in approximate Bayesian inference where the Bayesian treatment is restricted solely to the final (linear) layer of a deep neural network, with all earlier layers held fixed at their maximum a posteriori (MAP) estimates. The technique yields a tractable Gaussian approximation for the posterior over the last-layer weights, enabling efficient uncertainty quantification and fast Bayesian updates—especially advantageous in scenarios where full network retraining is computationally prohibitive. It has become a key tool for scalable Bayesian deep learning, efficient incremental learning, active learning query strategies, and conformal prediction with calibrated uncertainty (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

1. Bayesian Setup and Motivation

Given network parameters ωRP\omega \in \mathbb{R}^P and dataset D={(xi,yi)}i=1N\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N, the Bayesian posterior is p(ωD)p(ω)i=1Np(yixi,ω)p(\omega|\mathcal{D}) \propto p(\omega) \prod_{i=1}^N p(y_i|x_i, \omega). For modern networks, PP is typically large, rendering posterior inference computationally intractable. The last-layer Laplace approach freezes the representation parameters θ\theta (all layers except the last), and focuses on the posterior over the final weights wRDw \in \mathbb{R}^D, often after a feature extraction step φ(x)RD\varphi(x) \in \mathbb{R}^D. The marginal posterior then simplifies to p(wD,θ)p(w)p(yiφ(xi),w)p(w | \mathcal{D}, \theta) \propto p(w) \prod p(y_i | \varphi(x_i), w), reducing parameter dimensionality from PP to DD (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

2. Derivation of the Gaussian Posterior (Laplace Approximation)

The last-layer Laplace approximation proceeds by expanding the log posterior density logp(wD)\log p(w|\mathcal{D}) to second order around its mode (MAP estimate) μ\mu:

  • MAP estimation: μ=argmaxw[logp(w)+i=1Nlogp(yiφ(xi),w)]\mu = \arg\max_w \left[\log p(w) + \sum_{i=1}^N \log p(y_i|\varphi(x_i), w)\right] With a Gaussian prior p(w)=N(w;0,λ1I)p(w) = \mathcal{N}(w; 0, \lambda^{-1}I) and suitable likelihood (e.g., Gaussian for regression; logistic for binary classification).
  • Hessian computation: Negative log-posterior Hessian at w=μw = \mu is H:=w2logp(wD)w=μH := -\nabla^2_w \log p(w | \mathcal{D})|_{w=\mu} For a regression model with homoscedastic noise σ2\sigma^2, H=(1/σ2)ΦΦ+λIH = (1/\sigma^2)\Phi^\top \Phi + \lambda I, where Φ\Phi is the design matrix (Kim et al., 1 Dec 2025). For logistic regression, H=λI+igiφ(xi)φ(xi)H = \lambda I + \sum_i g_i \varphi(x_i)\varphi(x_i)^\top, with gi=σ(φiμ)(1σ(φiμ))g_i = \sigma(\varphi_i^\top \mu)(1 - \sigma(\varphi_i^\top \mu)).
  • Gaussian approximation: The posterior is approximated as q(w)=N(w;μ,Σ)q(w) = \mathcal{N}(w; \mu, \Sigma), with Σ=H1\Sigma = H^{-1} (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

3. Efficient Hessian Inversion and Fast Bayesian Updates

  • Low-rank updates: Direct inversion of HRD×DH \in \mathbb{R}^{D \times D} costs O(D3)O(D^3). However, HH is a low-rank modification of the prior precision, so the Sherman–Morrison–Woodbury identity or sequential rank-one updates permit inversion in O(ND2)O(N D^2) or O(D2)O(D^2) per update:
    • Initialize Σ0=(λI)1\Sigma_0 = (\lambda I)^{-1}.
    • Sequentially update with each rank-one term:
    • Σi=Σi1gi(Σi1φi)(Σi1φi)/βi\Sigma_i = \Sigma_{i-1} - g_i (\Sigma_{i-1}\varphi_i)(\Sigma_{i-1}\varphi_i)^\top / \beta_i with βi=1+giφiΣi1φi\beta_i = 1 + g_i \varphi_i^\top \Sigma_{i-1} \varphi_i.
    • For batch updates with new data, compute posterior update formulas for mean and covariance (Huseljic et al., 2022).
  • Incremental updates: Given prior q(wD)=N(μ,Σ)q(w|\mathcal{D}) = \mathcal{N}(\mu, \Sigma) and new data batch D+\mathcal{D}^+, perform a single Laplace step to yield new mean and covariance:
    • Δ=(x,y)D+(σ(φ(x)μ)y)φ(x)\Delta = \sum_{(x, y) \in \mathcal{D}^+} (\sigma(\varphi(x)^\top\mu) - y)\varphi(x)
    • Update Σ+\Sigma^+ and μ+\mu^+ using the rank-one formulae; iterate as needed for numerical stability.
    • Prediction at xx involves computing φ(x)Σφ(x)\varphi(x)^\top \Sigma \varphi(x) (Huseljic et al., 2022).
  • Computational advantage: Last-layer Laplace avoids retraining the full model or last layer entirely. Updates can be $10$–100×100\times faster than gradient-based retraining for moderate MM and DD; prediction is O(D2)O(D^2) versus multiple forward passes for MC methods (Huseljic et al., 2022).

4. Predictive Distributions and Uncertainty Decomposition

  • Posterior predictive: For new xx with feature map φ(x)\varphi(x), the predictive distribution is

p(yx,D)=N(φ(x)μ,σ2+φ(x)Σφ(x))p(y | x, \mathcal{D}) = \mathcal{N}(\varphi(x)^\top \mu, \sigma^2 + \varphi(x)^\top \Sigma \varphi(x))

for regression (Kim et al., 1 Dec 2025). The mean φ(x)μ\varphi(x)^\top \mu is the standard point prediction; the variance decomposes into - σ2\sigma^2: aleatoric (data) uncertainty, - φ(x)Σφ(x)\varphi(x)^\top \Sigma \varphi(x): epistemic (model) uncertainty.

  • Diagnostics: Fractional epistemic share r(x)=φ(x)Σφ(x)/(σ2+φ(x)Σφ(x))r(x) = \varphi(x)^\top \Sigma \varphi(x)/(\sigma^2 + \varphi(x)^\top \Sigma \varphi(x)) and overall posterior spread tr(Σ)\mathrm{tr}(\Sigma) are useful diagnostics for model uncertainty and interval tightness (Kim et al., 1 Dec 2025).

5. Applications: Incremental Learning, Active Learning, and Conformal Prediction

Application Area Role of Last-Layer Laplace Approximation Cited Paper
Incremental/Online Learning Enables closed-form Bayesian updates on arrival of new data batches, faster than retraining (Huseljic et al., 2022)
Active Learning Permits fully sequential acquisition strategies (e.g., Uncertainty Sampling, BALD, Query-by-Committee) with per-point model update and uncertainty estimation (Huseljic et al., 2022)
Conformal Prediction Used in CLAPS to yield efficiency-improved conformal intervals with split-conformal calibration; intervals adapt to epistemic uncertainty (Kim et al., 1 Dec 2025)
Uncertainty Quantification Provides fast, scalable Gaussian posterior over outputs for downstream decision-making (McInerney et al., 2024)

Active learning context: The method enables the use of computationally intensive acquisition functions (e.g., BALD) after each new label is acquired, rather than in large pre-selected batches. Empirically, SNGP-LA yields 2–5% better learning curves than traditional top-batch-based batch selection (Huseljic et al., 2022). Conformal prediction context: CLAPS combines the Gaussian posterior from last-layer Laplace with a two-sided posterior CDF conformity score, producing tighter prediction intervals with finite-sample coverage, particularly effective when epistemic variance is nontrivial (Kim et al., 1 Dec 2025).

6. Implementation Notes and Practical Considerations

  • Cholesky and Hessian-vector products: For width dd final-layer, Cholesky-based inversion is feasible unless dd is extremely large. For large dd, conjugate-gradient (CG) methods can be used for Hessian solves (McInerney et al., 2024, Kim et al., 1 Dec 2025).
  • Curvature approximations: In practice, the Gauss–Newton approximation is often used for the Hessian (McInerney et al., 2024). For further efficiency, diagonal, Kronecker-factored, or low-rank approximations can be applied.
  • Numerical stability: Additional damping is applied to ensure positive-definiteness of the Hessian; fallback strategies are employed when the Hessian is ill-conditioned (McInerney et al., 2024).
  • Scalability: The technique is computationally viable in settings where inverting the full-network Hessian is infeasible—applicable for last-layer dimensions up to several thousand.

7. Limitations and Theoretical Considerations

  • Representation uncertainty ignored: Only last-layer weights are treated Bayesian. Uncertainty in the backbone is not captured; the method assumes feature extractor parameters are well-optimized and certain (Huseljic et al., 2022, Kim et al., 1 Dec 2025).
  • Posterior contraction: As the data set size nn increases, epistemic uncertainty diminishes (φ(x)Σφ(x)0\varphi(x)^\top \Sigma \varphi(x) \to 0), so benefits over mean-based or residual-based methods vanish (Kim et al., 1 Dec 2025).
  • Homogeneity assumptions: Many approaches assume homoscedastic Gaussian noise; strongly heteroscedastic settings require explicit modeling via scale or quantile heads (Kim et al., 1 Dec 2025).
  • Fidelity vs. Scalability trade-off: High modeling fidelity would require full Laplace approximations over all parameters, which is computationally prohibitive. Last-layer Laplace trades off some fidelity for tractability (McInerney et al., 2024).
  • Accuracy depends on feature representation: The posterior variance is only meaningful to the extent that the frozen representation φ(x)\varphi(x) is expressive and robust (Kim et al., 1 Dec 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Last-Layer Laplace Approximation.