Last-Layer Laplace Approximation
- Last-layer Laplace Approximation is a Bayesian inference technique that treats only the final neural network layer stochastically, yielding a tractable Gaussian posterior over its weights.
- It enables efficient uncertainty quantification, fast incremental learning, and improved active and conformal prediction through closed-form updates.
- The method leverages a second-order Taylor expansion around the MAP estimate and efficient Hessian inversion to bypass full-network retraining, trading off some fidelity for scalability.
The last-layer Laplace approximation is a methodology in approximate Bayesian inference where the Bayesian treatment is restricted solely to the final (linear) layer of a deep neural network, with all earlier layers held fixed at their maximum a posteriori (MAP) estimates. The technique yields a tractable Gaussian approximation for the posterior over the last-layer weights, enabling efficient uncertainty quantification and fast Bayesian updates—especially advantageous in scenarios where full network retraining is computationally prohibitive. It has become a key tool for scalable Bayesian deep learning, efficient incremental learning, active learning query strategies, and conformal prediction with calibrated uncertainty (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).
1. Bayesian Setup and Motivation
Given network parameters and dataset , the Bayesian posterior is . For modern networks, is typically large, rendering posterior inference computationally intractable. The last-layer Laplace approach freezes the representation parameters (all layers except the last), and focuses on the posterior over the final weights , often after a feature extraction step . The marginal posterior then simplifies to , reducing parameter dimensionality from to (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).
2. Derivation of the Gaussian Posterior (Laplace Approximation)
The last-layer Laplace approximation proceeds by expanding the log posterior density to second order around its mode (MAP estimate) :
- MAP estimation: With a Gaussian prior and suitable likelihood (e.g., Gaussian for regression; logistic for binary classification).
- Hessian computation: Negative log-posterior Hessian at is For a regression model with homoscedastic noise , , where is the design matrix (Kim et al., 1 Dec 2025). For logistic regression, , with .
- Gaussian approximation: The posterior is approximated as , with (Huseljic et al., 2022, Kim et al., 1 Dec 2025, McInerney et al., 2024).
3. Efficient Hessian Inversion and Fast Bayesian Updates
- Low-rank updates: Direct inversion of costs . However, is a low-rank modification of the prior precision, so the Sherman–Morrison–Woodbury identity or sequential rank-one updates permit inversion in or per update:
- Initialize .
- Sequentially update with each rank-one term:
- with .
- For batch updates with new data, compute posterior update formulas for mean and covariance (Huseljic et al., 2022).
- Incremental updates: Given prior and new data batch , perform a single Laplace step to yield new mean and covariance:
- Update and using the rank-one formulae; iterate as needed for numerical stability.
- Prediction at involves computing (Huseljic et al., 2022).
- Computational advantage: Last-layer Laplace avoids retraining the full model or last layer entirely. Updates can be $10$– faster than gradient-based retraining for moderate and ; prediction is versus multiple forward passes for MC methods (Huseljic et al., 2022).
4. Predictive Distributions and Uncertainty Decomposition
- Posterior predictive: For new with feature map , the predictive distribution is
for regression (Kim et al., 1 Dec 2025). The mean is the standard point prediction; the variance decomposes into - : aleatoric (data) uncertainty, - : epistemic (model) uncertainty.
- Diagnostics: Fractional epistemic share and overall posterior spread are useful diagnostics for model uncertainty and interval tightness (Kim et al., 1 Dec 2025).
5. Applications: Incremental Learning, Active Learning, and Conformal Prediction
| Application Area | Role of Last-Layer Laplace Approximation | Cited Paper |
|---|---|---|
| Incremental/Online Learning | Enables closed-form Bayesian updates on arrival of new data batches, faster than retraining | (Huseljic et al., 2022) |
| Active Learning | Permits fully sequential acquisition strategies (e.g., Uncertainty Sampling, BALD, Query-by-Committee) with per-point model update and uncertainty estimation | (Huseljic et al., 2022) |
| Conformal Prediction | Used in CLAPS to yield efficiency-improved conformal intervals with split-conformal calibration; intervals adapt to epistemic uncertainty | (Kim et al., 1 Dec 2025) |
| Uncertainty Quantification | Provides fast, scalable Gaussian posterior over outputs for downstream decision-making | (McInerney et al., 2024) |
Active learning context: The method enables the use of computationally intensive acquisition functions (e.g., BALD) after each new label is acquired, rather than in large pre-selected batches. Empirically, SNGP-LA yields 2–5% better learning curves than traditional top-batch-based batch selection (Huseljic et al., 2022). Conformal prediction context: CLAPS combines the Gaussian posterior from last-layer Laplace with a two-sided posterior CDF conformity score, producing tighter prediction intervals with finite-sample coverage, particularly effective when epistemic variance is nontrivial (Kim et al., 1 Dec 2025).
6. Implementation Notes and Practical Considerations
- Cholesky and Hessian-vector products: For width final-layer, Cholesky-based inversion is feasible unless is extremely large. For large , conjugate-gradient (CG) methods can be used for Hessian solves (McInerney et al., 2024, Kim et al., 1 Dec 2025).
- Curvature approximations: In practice, the Gauss–Newton approximation is often used for the Hessian (McInerney et al., 2024). For further efficiency, diagonal, Kronecker-factored, or low-rank approximations can be applied.
- Numerical stability: Additional damping is applied to ensure positive-definiteness of the Hessian; fallback strategies are employed when the Hessian is ill-conditioned (McInerney et al., 2024).
- Scalability: The technique is computationally viable in settings where inverting the full-network Hessian is infeasible—applicable for last-layer dimensions up to several thousand.
7. Limitations and Theoretical Considerations
- Representation uncertainty ignored: Only last-layer weights are treated Bayesian. Uncertainty in the backbone is not captured; the method assumes feature extractor parameters are well-optimized and certain (Huseljic et al., 2022, Kim et al., 1 Dec 2025).
- Posterior contraction: As the data set size increases, epistemic uncertainty diminishes (), so benefits over mean-based or residual-based methods vanish (Kim et al., 1 Dec 2025).
- Homogeneity assumptions: Many approaches assume homoscedastic Gaussian noise; strongly heteroscedastic settings require explicit modeling via scale or quantile heads (Kim et al., 1 Dec 2025).
- Fidelity vs. Scalability trade-off: High modeling fidelity would require full Laplace approximations over all parameters, which is computationally prohibitive. Last-layer Laplace trades off some fidelity for tractability (McInerney et al., 2024).
- Accuracy depends on feature representation: The posterior variance is only meaningful to the extent that the frozen representation is expressive and robust (Kim et al., 1 Dec 2025).
References
- “Efficient Bayesian Updates for Deep Learning via Laplace Approximations” (Huseljic et al., 2022)
- “CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace” (Kim et al., 1 Dec 2025)
- “Variation Due to Regularization Tractably Recovers Bayesian Deep Learning” (McInerney et al., 2024)
- “An Extended Simplified Laplace strategy for Approximate Bayesian inference of Latent Gaussian Models using R-INLA” (Chiuchiolo et al., 2022)