Papers
Topics
Authors
Recent
Search
2000 character limit reached

Regularized Random Fourier Features (RRFF)

Updated 26 December 2025
  • RRFF are advanced kernel approximations that integrate explicit regularization with data-dependent feature selection for efficient, statistically sound learning.
  • The method balances kernel regularization and feature truncation via plain or leverage-weighted sampling to reduce sample complexity and achieve minimax rates.
  • RRFF extends to operator learning and high-dimensional regimes, revealing phenomena such as implicit regularization, double descent, and stable reconstructions.

Regularized Random Fourier Features (RRFF) are a class of random feature approximations for kernel methods that incorporate explicit regularization and data-dependent feature selection to achieve sharp statistical guarantees and significant reductions in computational cost relative to classical kernel learning. RRFF systematically balances kernel regularization with random feature truncation—using the number and distribution of features, frequency-weighted penalties, and potentially empirical (leverage-score) weighting—yielding algorithms that attain minimax learning rates with dramatically fewer features, provide robust operator generalization in the presence of noise, and reveal new phenomena such as implicit regularization and double descent in high-dimensional regimes.

1. Mathematical Foundations and Problem Setup

RRFF begins with a shift-invariant positive definite kernel k(x,y)k(x, y) on X×X\mathcal{X} \times \mathcal{X}, possessing a Bochner spectral representation:

k(x,y)=Vz(v,x)z(v,y)p(v)dv,k(x, y) = \int_{\mathbb{V}} z(v, x)\,z(v, y)\,p(v)\,dv,

where pp is a probability density over V\mathbb{V}, and z(v,x)z(v, x) is typically trigonometric (e.g., z(v,x)=2cos(vTx+b)z(v, x) = \sqrt{2} \cos(v^T x + b) for bUniform[0,2π]b \sim \mathrm{Uniform}[0, 2\pi]). Observing nn i.i.d. data pairs (xi,yi)(x_i, y_i) drawn from PP and denoting by KRn×nK \in \mathbb{R}^{n\times n} the Gram matrix, RRFF replaces kk by a Monte Carlo average:

k~(x,y)=1mj=1mz(vj,x)z(vj,y),\tilde{k}(x, y) = \tfrac{1}{m} \sum_{j=1}^m z(v_j, x) z(v_j, y),

where vjv_j are drawn from pp (“plain RFF”) or a data-dependent density qq (“weighted RFF”).

In the context of empirical risk minimization with convex loss l(y,f(x))l(y, f(x)) and Tikhonov regularization parameter λ\lambda, RRFF reduces infinite-dimensional kernel regression/classification to finite-dimensional linear problems:

minβRm1ni=1nl(yi,k~(xi,)Tβ)+λβ22.\min_{\beta \in \mathbb{R}^m} \frac{1}{n}\sum_{i=1}^n l(y_i, \tilde{k}(x_i, \cdot)^T \beta) + \lambda \|\beta\|_2^2.

2. Risk Guarantees and Sample Complexity

Rigorous non-asymptotic risk bounds under both squared error and general Lipschitz loss have been established for RRFF (Li et al., 2018, Li, 2021). The key complexity control parameter is the effective degrees of freedom of the kernel,

dKλ=Tr[K(K+nλI)1]=i=1nλiλi+nλ,d_K^\lambda = \mathrm{Tr}\left[K(K + n\lambda I)^{-1}\right] = \sum_{i=1}^n \frac{\lambda_i}{\lambda_i + n\lambda},

where {λi}\{\lambda_i\} are the eigenvalues of KK.

For Kernel Ridge Regression (squared error loss), the risk excess of RRFF with mm features is—under mild moment assumptions—upper bounded by:

E[(yf~mλ(x))2]E[(yfH(x))2]4λ+O(n1/2)+(E[(yf^λ(x))2]E[(yfH(x))2])\mathbb{E}[(y - \tilde{f}^\lambda_m(x))^2] - \mathbb{E}[(y - f_{\mathcal{H}}(x))^2] \leq 4\lambda + O(n^{-1/2}) + (\mathbb{E}[(y - \hat{f}^\lambda(x))^2] - \mathbb{E}[(y - f_{\mathcal{H}}(x))^2])

as soon as md~log(dKλ/δ)m \gtrsim d_{\tilde{\ell}}\,\log(d_K^\lambda/\delta), where d~d_{\tilde{\ell}} depends on the sampling density (d~=z02/λd_{\tilde{\ell}} = z_0^2/\lambda for plain RFF, d~=dKλd_{\tilde{\ell}} = d_K^\lambda for leverage-weighted RFF).

Refined Rates: Fast spectral decay of KK allows replacement of O(n1/2)O(n^{-1/2}) with O(n1),O((logn)/n)O(n^{-1}), O((\log n)/n), or better (Li et al., 2018).

For general convex Lipschitz losses (e.g., SVM or logistic regression), the same structure holds, but the bias term 4λ4\lambda is replaced by O(λ)O(\sqrt{\lambda}), necessitating smaller λ\lambda to attain minimax rates (Li, 2021).

3. Feature Selection Schemes and Leverage Weighting

Two principal feature sampling regimes are studied (Li et al., 2018, Li, 2021):

  • Plain RFF: Frequencies vjv_j drawn i.i.d. from p(v)p(v); sample complexity is m=O(z02/λlog(dKλ/δ))m = O(z_0^2/\lambda \cdot \log(d_K^\lambda/\delta)), which (with λn1/2\lambda \asymp n^{-1/2}) is typically O(nlogn)O(\sqrt{n}\log n).
  • Leverage-Weighted RFF: Sampling frequencies proportional to empirical ridge leverage scores

lλ(v)=p(v)z(v,x)T(K+nλI)1z(v,x),l_\lambda(v) = p(v) z(v, x)^T (K + n\lambda I)^{-1} z(v, x),

normalized to q(v)=lλ(v)/dKλq^*(v) = l_\lambda(v)/d_K^\lambda. This reduces the required features to m=O(dKλlog(dKλ/δ))m = O(d_K^\lambda \log(d_K^\lambda/\delta)), which is often O(1)O(1) or polylog(n)\operatorname{polylog}(n) in benign regimes.

Since computing leverage scores lλ(v)l_\lambda(v) exactly is computationally prohibitive, a fast two-stage approximation is used: a large pool of candidate features is sampled from p(v)p(v). The KK-based expressions are replaced by their empirical RFF proxy, and a smaller subset is then sampled using these approximate leverage weights, retaining the same statistical guarantees (Li et al., 2018).

Pseudocode: Approximate Leverage-Weighted RFF

1
2
3
4
5
6
7
Input: Data {(x_i, y_i)}_{i=1}^n, kernel k, regularization λ,
       pool-size s, target m ≪ s.
1. Draw v_1,…,v_s ∼ p(v), build Z_s ∈ ℝ^{n×s}, Z_s[i,j] = z(v_j, x_i)
2. Compute A = Z_s^T Z_s, G = (A/s + nλ I_s)^{-1}
3. For j=1,…,s, set w_j = [A G]_{jj}
4. Normalize p_j = w_j / (∑_j w_j)
5. Sample m indices according to p_j, output features and weights

4. Extensions: Frequency-Weighted Regularization and Operator Learning

Beyond ridge regularization, RRFF incorporates frequency-weighted penalties to suppress high-frequency noise, particularly beneficial for operator learning in Sobolev or Matérn spaces (Yu et al., 19 Dec 2025). Features can be drawn from heavy-tailed distributions (e.g., Student's tt), and regularization weights may grow as a function of frequency, w(ω)=Σ1/2ωpw(\omega) = \|\Sigma^{-1/2} \omega\|^p or w(ω)=1+ωpw(\omega) = 1 + \|\omega\|^p.

This methodology extends to RRFF-FEM for operator learning, coupling the finite-dimensional RRFF regression with finite element reconstruction to produce stable HsH^s-regular output functions. High-probability bounds on singular values of the random feature matrix guarantee well-conditioning and generalization once N=O(mlogm)N = O(m\log m), with robustness to noise in practical PDE benchmarks (Yu et al., 19 Dec 2025).

5. Implicit Regularization and High-Dimensional Phenomena

RRFF with a finite number of features introduces implicit regularization beyond the explicit ridge penalty. In the Gaussian RFF model, for PP features and regularization λ\lambda, the average RF predictor matches a KRR predictor with effective ridge λ~>λ\tilde{\lambda} > \lambda, determined as the solution to

λ~=λ+λ~γ1Ni=1Ndiλ~+di,γ=P/N\tilde\lambda = \lambda + \frac{\tilde\lambda}{\gamma} \frac{1}{N} \sum_{i=1}^N \frac{d_i}{\tilde\lambda + d_i},\quad \gamma = P/N

The gap λ~λ\tilde{\lambda} - \lambda vanishes as PP \to \infty but accounts for most of the finite-sample bias/variance trade-off (Jacot et al., 2020).

In the double-asymptotic regime (large n,p,Nn, p, N), phase transitions and double descent emerge. At the interpolation threshold ($2N = n$), test error exhibits a singularity in the ridgeless regime (λ0\lambda \to 0). For $2N < n$ (under-parameterized), train/test errors increase with decreasing NN; for $2N > n$ (over-parameterized), errors descend again, with best results in the mildly over-parameterized regime (Liao et al., 2020).

6. Computational Complexity and Practical Performance

RRFF reduces computational bottlenecks from O(n3)O(n^3) time and O(n2)O(n^2) memory (full kernel methods) to O(ns2+s3)O(ns^2 + s^3) time and O(ns)O(ns) memory, where ss is typically O(nlogn)O(\sqrt n \log n) or smaller with leverage-weighting (Li et al., 2018, Li, 2021). Operator learning benchmarks confirm reductions in both error and run time relative to unregularized RFF, with RRFF-FEM producing smoother, more stable reconstructions with lower error in noisy settings (Yu et al., 19 Dec 2025).

In classification settings with Lipschitz loss, plain RRFF achieves minimax O(1/n)O(1/\sqrt n) risk with O(nlogn)O(\sqrt n\log n) features and fast O(1/n)O(1/n) rates under low-noise, while leverage-weighted RRFF can achieve near-linear scaling in benign regimes (Li, 2021).

7. Summary and Typical Parameter Choices

RRFF unites kernel regularization and random feature approximation. Selecting

mdKλlogdKλm \approx d_K^\lambda \log d_K^\lambda

features suffices to match the statistical performance of full-kernel methods, and leverage weighting further reduces mm by directly targeting the effective degrees of freedom. Frequency weighting and operator reconstruction via finite elements enable robust, scalable learning of function-to-function maps with stability in high-noise and high-dimensional regimes.

Feature Selection Sample Complexity Computational Cost
Plain RFF O(nlogn)O(\sqrt n\log n) O(ns2+s3)O(ns^2 + s^3)
Leverage-Weighted RFF O(dKλlogdKλ)O(d_K^\lambda\log d_K^\lambda) O(ns2+s3)O(ns^2 + s^3)
RRFF-FEM (operator learn.) O(mlogm)O(m\log m) O(m3)O(m^3) (FE solve)

RRFF methods have established theoretical risk guarantees, efficient algorithms for approximating leverage scores, phase-aware regularization effects, and demonstrated empirical superiority in regression, classification, and operator learning across a broad range of settings (Li et al., 2018, Li, 2021, Jacot et al., 2020, Yu et al., 19 Dec 2025, Liao et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Regularized Random Fourier Feature (RRFF).