Last-Layer Laplace Approximation

Updated 2 January 2026

Last-layer Laplace Approximation is a Bayesian inference technique that treats only the final neural network layer stochastically, yielding a tractable Gaussian posterior over its weights.
It enables efficient uncertainty quantification, fast incremental learning, and improved active and conformal prediction through closed-form updates.
The method leverages a second-order Taylor expansion around the MAP estimate and efficient Hessian inversion to bypass full-network retraining, trading off some fidelity for scalability.

The last-layer Laplace approximation is a methodology in approximate Bayesian inference where the Bayesian treatment is restricted solely to the final (linear) layer of a deep neural network, with all earlier layers held fixed at their maximum a posteriori (MAP) estimates. The technique yields a tractable Gaussian approximation for the posterior over the last-layer weights, enabling efficient uncertainty quantification and fast Bayesian updates—especially advantageous in scenarios where full network retraining is computationally prohibitive. It has become a key tool for scalable Bayesian deep learning, efficient incremental learning, active learning query strategies, and conformal prediction with calibrated uncertainty (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

1. Bayesian Setup and Motivation

Given network parameters $\omega \in \mathbb{R}^P$ and dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ , the Bayesian posterior is $p(\omega|\mathcal{D}) \propto p(\omega) \prod_{i=1}^N p(y_i|x_i, \omega)$ . For modern networks, $P$ is typically large, rendering posterior inference computationally intractable. The last-layer Laplace approach freezes the representation parameters $\theta$ (all layers except the last), and focuses on the posterior over the final weights $w \in \mathbb{R}^D$ , often after a feature extraction step $\varphi(x) \in \mathbb{R}^D$ . The marginal posterior then simplifies to $p(w | \mathcal{D}, \theta) \propto p(w) \prod p(y_i | \varphi(x_i), w)$ , reducing parameter dimensionality from $P$ to $D$ (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

2. Derivation of the Gaussian Posterior (Laplace Approximation)

The last-layer Laplace approximation proceeds by expanding the log posterior density $\log p(w|\mathcal{D})$ to second order around its mode (MAP estimate) $\mu$ :

MAP estimation: $\mu = \arg\max_w \left[\log p(w) + \sum_{i=1}^N \log p(y_i|\varphi(x_i), w)\right]$ With a Gaussian prior $p(w) = \mathcal{N}(w; 0, \lambda^{-1}I)$ and suitable likelihood (e.g., Gaussian for regression; logistic for binary classification).
Hessian computation: Negative log-posterior Hessian at $w = \mu$ is $H := -\nabla^2_w \log p(w | \mathcal{D})|_{w=\mu}$ For a regression model with homoscedastic noise $\sigma^2$ , $H = (1/\sigma^2)\Phi^\top \Phi + \lambda I$ , where $\Phi$ is the design matrix (Kim et al., 1 Dec 2025). For logistic regression, $H = \lambda I + \sum_i g_i \varphi(x_i)\varphi(x_i)^\top$ , with $g_i = \sigma(\varphi_i^\top \mu)(1 - \sigma(\varphi_i^\top \mu))$ .
Gaussian approximation: The posterior is approximated as $q(w) = \mathcal{N}(w; \mu, \Sigma)$ , with $\Sigma = H^{-1}$ (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).

3. Efficient Hessian Inversion and Fast Bayesian Updates

Low-rank updates: Direct inversion of $H \in \mathbb{R}^{D \times D}$ $H \in R^{D \times D}$ costs $O(D^3)$ $O (D^{3})$ . However, $H$ $H$ is a low-rank modification of the prior precision, so the Sherman–Morrison–Woodbury identity or sequential rank-one updates permit inversion in $O(N D^2)$ $O (N D^{2})$ or $O(D^2)$ $O (D^{2})$ per update:
- Initialize $\Sigma_0 = (\lambda I)^{-1}$ .
- Sequentially update with each rank-one term:
- $\Sigma_i = \Sigma_{i-1} - g_i (\Sigma_{i-1}\varphi_i)(\Sigma_{i-1}\varphi_i)^\top / \beta_i$ with $\beta_i = 1 + g_i \varphi_i^\top \Sigma_{i-1} \varphi_i$ .
- For batch updates with new data, compute posterior update formulas for mean and covariance (Huseljic et al., 2022).
Incremental updates: Given prior $q(w|\mathcal{D}) = \mathcal{N}(\mu, \Sigma)$ $q (w ∣ D) = N (μ, Σ)$ and new data batch $\mathcal{D}^+$ $D^{+}$ , perform a single Laplace step to yield new mean and covariance:
- $\Delta = \sum_{(x, y) \in \mathcal{D}^+} (\sigma(\varphi(x)^\top\mu) - y)\varphi(x)$
- Update $\Sigma^+$ and $\mu^+$ using the rank-one formulae; iterate as needed for numerical stability.
- Prediction at $x$ involves computing $\varphi(x)^\top \Sigma \varphi(x)$ (Huseljic et al., 2022).
Computational advantage: Last-layer Laplace avoids retraining the full model or last layer entirely. Updates can be $10$– $100\times$ faster than gradient-based retraining for moderate $M$ and $D$ ; prediction is $O(D^2)$ versus multiple forward passes for MC methods (Huseljic et al., 2022).

4. Predictive Distributions and Uncertainty Decomposition

Posterior predictive: For new $x$ with feature map $\varphi(x)$ , the predictive distribution is

$p(y | x, \mathcal{D}) = \mathcal{N}(\varphi(x)^\top \mu, \sigma^2 + \varphi(x)^\top \Sigma \varphi(x))$

for regression (Kim et al., 1 Dec 2025). The mean $\varphi(x)^\top \mu$ is the standard point prediction; the variance decomposes into - $\sigma^2$ : aleatoric (data) uncertainty, - $\varphi(x)^\top \Sigma \varphi(x)$ : epistemic (model) uncertainty.

Diagnostics: Fractional epistemic share $r(x) = \varphi(x)^\top \Sigma \varphi(x)/(\sigma^2 + \varphi(x)^\top \Sigma \varphi(x))$ and overall posterior spread $\mathrm{tr}(\Sigma)$ are useful diagnostics for model uncertainty and interval tightness (Kim et al., 1 Dec 2025).

5. Applications: Incremental Learning, Active Learning, and Conformal Prediction

Application Area	Role of Last-Layer Laplace Approximation	Cited Paper
Incremental/Online Learning	Enables closed-form Bayesian updates on arrival of new data batches, faster than retraining	(Huseljic et al., 2022)
Active Learning	Permits fully sequential acquisition strategies (e.g., Uncertainty Sampling, BALD, Query-by-Committee) with per-point model update and uncertainty estimation	(Huseljic et al., 2022)
Conformal Prediction	Used in CLAPS to yield efficiency-improved conformal intervals with split-conformal calibration; intervals adapt to epistemic uncertainty	(Kim et al., 1 Dec 2025)
Uncertainty Quantification	Provides fast, scalable Gaussian posterior over outputs for downstream decision-making	(McInerney et al., 2024)

Active learning context: The method enables the use of computationally intensive acquisition functions (e.g., BALD) after each new label is acquired, rather than in large pre-selected batches. Empirically, SNGP-LA yields 2–5% better learning curves than traditional top-batch-based batch selection (Huseljic et al., 2022). Conformal prediction context: CLAPS combines the Gaussian posterior from last-layer Laplace with a two-sided posterior CDF conformity score, producing tighter prediction intervals with finite-sample coverage, particularly effective when epistemic variance is nontrivial (Kim et al., 1 Dec 2025).

6. Implementation Notes and Practical Considerations

Cholesky and Hessian-vector products: For width $d$ final-layer, Cholesky-based inversion is feasible unless $d$ is extremely large. For large $d$ , conjugate-gradient (CG) methods can be used for Hessian solves (McInerney et al., 2024, Kim et al., 1 Dec 2025).
Curvature approximations: In practice, the Gauss–Newton approximation is often used for the Hessian (McInerney et al., 2024). For further efficiency, diagonal, Kronecker-factored, or low-rank approximations can be applied.
Numerical stability: Additional damping is applied to ensure positive-definiteness of the Hessian; fallback strategies are employed when the Hessian is ill-conditioned (McInerney et al., 2024).
Scalability: The technique is computationally viable in settings where inverting the full-network Hessian is infeasible—applicable for last-layer dimensions up to several thousand.

7. Limitations and Theoretical Considerations

Representation uncertainty ignored: Only last-layer weights are treated Bayesian. Uncertainty in the backbone is not captured; the method assumes feature extractor parameters are well-optimized and certain (Huseljic et al., 2022, Kim et al., 1 Dec 2025).
Posterior contraction: As the data set size $n$ increases, epistemic uncertainty diminishes ( $\varphi(x)^\top \Sigma \varphi(x) \to 0$ ), so benefits over mean-based or residual-based methods vanish (Kim et al., 1 Dec 2025).
Homogeneity assumptions: Many approaches assume homoscedastic Gaussian noise; strongly heteroscedastic settings require explicit modeling via scale or quantile heads (Kim et al., 1 Dec 2025).
Fidelity vs. Scalability trade-off: High modeling fidelity would require full Laplace approximations over all parameters, which is computationally prohibitive. Last-layer Laplace trades off some fidelity for tractability (McInerney et al., 2024).
Accuracy depends on feature representation: The posterior variance is only meaningful to the extent that the frozen representation $\varphi(x)$ is expressive and robust (Kim et al., 1 Dec 2025).

References

“Efficient Bayesian Updates for Deep Learning via Laplace Approximations” (Huseljic et al., 2022)
“CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace” (Kim et al., 1 Dec 2025)
“Variation Due to Regularization Tractably Recovers Bayesian Deep Learning” (McInerney et al., 2024)
“An Extended Simplified Laplace strategy for Approximate Bayesian inference of Latent Gaussian Models using R-INLA” (Chiuchiolo et al., 2022)