Sparse Spectrum GP Regression
- Sparse Spectrum Gaussian Process Regression (SSGPR) is a scalable approach that approximates stationary kernels using a finite set of random Fourier features.
- It employs Bochner’s theorem and Bayesian variational inference to optimize spectral parameters, reducing overfitting and quantifying uncertainty effectively.
- SSGPR supports both full-batch and stochastic gradient optimization, with extensions for nonstationary data via warped kernel models and local adaptive predictions.
Sparse Spectrum Gaussian Process Regression (SSGPR) encompasses a family of scalable Gaussian process (GP) regression models that approximate the GP kernel by a finite expansion in random Fourier features. This approach leverages Bochner’s theorem to represent stationary covariance functions, providing linear-algebraic complexity in both training and prediction. The method has evolved from point-estimate models, through closed-form variational Bayesian schemes, to distributed, mini-batch, and fully stochastic frameworks, as well as extensions to nonstationary kernels via warped inputs. This article describes the theoretical foundations, optimization strategies, algorithmic mechanisms, and empirical properties of SSGPR and its principal Bayesian enhancements.
1. Fourier-Based Sparse Spectrum Approximation
SSGPR employs a finite-rank approximation of stationary kernels using Monte-Carlo Fourier features. For a kernel on , Bochner's theorem yields
where is the spectral density. Approximating the integral with sampled frequencies and phases leads to
with
This generates an explicit, low-dimensional random feature embedding suitable for large-scale regression, with Bayesian linear regression in feature space providing mean and covariance predictions (Tompkins et al., 2020, Gal et al., 2015).
2. Bayesian and Variational Models for SSGPR
Classical SSGP either fixes the spectrum randomly or optimizes it by marginal likelihood, potentially leading to overconfidence or overfitting. Bayesian SSGPR remedies this by placing priors over frequencies and weights. In particular, the Variational Sparse Spectrum GP (VSSGP) (Gal et al., 2015) introduces variational posteriors for spectral inputs and weights:
where is typically Gaussian, uniform, and is Gaussian for each output. Learning optimizes the evidence lower bound (ELBO)
with all expectations (notably the mean and covariance of under ) handled analytically via Gaussian–cosine identities. The optimal weight posterior admits a closed-form solution given and , yielding efficient updates and objective reduction (Gal et al., 2015, Tan et al., 2013, Hoang et al., 2016).
Variational inference may use nonconjugate variational message passing, as in Tan et al. (Tan et al., 2013), or a stochastic reparameterization technique for large-scale regression (Hoang et al., 2016).
3. Optimization Frameworks and Scalability
Training SSGPR models is feasible via full-batch or stochastic gradient techniques. For variational models, the ELBO is factorizable across datapoints and outputs:
enabling distributed computation, parallelization, and stochastic mini-batches with per-step cost , where is the mini-batch size. Optimization employs standard algorithms (L-BFGS, RMSProp) while the reparameterization trick in the sVBSSGP framework allows unbiased constant-time stochastic gradients irrespective of data size (Gal et al., 2015, Hoang et al., 2016).
For nonconjugate settings (kernel hyperparameters, variances), natural-gradient updates and step-size adaptation further accelerate convergence (Tan et al., 2013). Empirical use of adaptive neighbourhoods for nonstationary data shrinks local prediction regions according to inferred lengthscales, providing automatic variable selection.
4. Predictions and Inference
Predictive mean and variance for SSGPR at a test point are computed in closed form:
- Predictive mean:
- Predictive covariance:
with all terms available via closed-form Gaussian integrals over the variational posteriors of the spectrum (Gal et al., 2015). Bayesian model averaging via multiple samples further enhances robustness, especially in high-dimensional, locally correlated settings (Hoang et al., 2016).
Local adaptive prediction leverages reweighted neighbourhood selection using Mahalanobis-type distances based on inferred lengthscales, improving prediction under nonstationarity and irrelevant inputs (Tan et al., 2013).
5. Extensions to Nonstationary and Warped Kernel Models
SSGPR has been extended to nonstationary kernels via input-dependent Gaussian warping. The SSWIM framework (Tompkins et al., 2020) introduces a “warped” kernel
where each input induces a Gaussian measure with mean and covariance provided by lower-level GPs. The finite-feature expansion for warped SSGPR involves
retaining closed-form training and inference. Empirical evaluation shows that single or stacked warping layers recover highly nonstationary functions and outperform vanilla SSGPR, SVGP, Deep Kernel Learning, and doubly-stochastic deep GPs, with minimal parameter overhead (Tompkins et al., 2020).
6. Comparative Properties and Empirical Evaluation
The following qualitative distinctions among sparse GP models are established (Gal et al., 2015):
- SPGP (sparse pseudo-inputs) captures local uncertainty but fails on global spectral structure.
- SSGP with point spectrum over-fits, leading to degenerate frequency selection.
- RP (random projections) exhibits systematic under-fitting, as random frequencies do not adapt to data.
- VSSGP and sVBSSGP overcome these issues via full Bayesian, variational posteriors, preventing overconfident spectral assignments and regularizing uncertainty.
Empirical studies demonstrate robustness and scalability: VSSGP and sVBSSGP match or surpass inducing-point and fixed-spectrum methods in RMSE and log likelihood, particularly on high-dimensional datasets and large-scale benchmarks (Hoang et al., 2016, Tan et al., 2013). Computational cost scales as to per iteration for standard models, and per iteration for constant-cost, partitioned stochastic gradient variants (Hoang et al., 2016).
7. Practical Considerations and Model Selection
SSGPR variants offer parameter efficiency, requiring only a moderate number of random features (–) even in high-dimensional regimes. Chip-efficient warping models demand only lower-level GP parameters, with . Initialization strategies favor identity warps and small input covariances. Practical setups—mini-batch training, fast convergence acceleration, local Mahalanobis prediction—enable principled uncertainty quantification, feature selection, and adaptation to complex, nonstationary data.
This suggests SSGPR and its Bayesian extensions provide a principled, scalable regression modeling framework that efficiently resolves spectral uncertainty, regularizes overfitting, and adapts to nonstationary structure via input warpings and local inference (Gal et al., 2015, Tompkins et al., 2020, Tan et al., 2013, Hoang et al., 2016).