Regularized Random Fourier Features (RRFF)

Updated 26 December 2025

RRFF are advanced kernel approximations that integrate explicit regularization with data-dependent feature selection for efficient, statistically sound learning.
The method balances kernel regularization and feature truncation via plain or leverage-weighted sampling to reduce sample complexity and achieve minimax rates.
RRFF extends to operator learning and high-dimensional regimes, revealing phenomena such as implicit regularization, double descent, and stable reconstructions.

Regularized Random Fourier Features (RRFF) are a class of random feature approximations for kernel methods that incorporate explicit regularization and data-dependent feature selection to achieve sharp statistical guarantees and significant reductions in computational cost relative to classical kernel learning. RRFF systematically balances kernel regularization with random feature truncation—using the number and distribution of features, frequency-weighted penalties, and potentially empirical (leverage-score) weighting—yielding algorithms that attain minimax learning rates with dramatically fewer features, provide robust operator generalization in the presence of noise, and reveal new phenomena such as implicit regularization and double descent in high-dimensional regimes.

1. Mathematical Foundations and Problem Setup

RRFF begins with a shift-invariant positive definite kernel $k(x, y)$ on $\mathcal{X} \times \mathcal{X}$ , possessing a Bochner spectral representation:

$k(x, y) = \int_{\mathbb{V}} z(v, x)\,z(v, y)\,p(v)\,dv,$

where $p$ is a probability density over $\mathbb{V}$ , and $z(v, x)$ is typically trigonometric (e.g., $z(v, x) = \sqrt{2} \cos(v^T x + b)$ for $b \sim \mathrm{Uniform}[0, 2\pi]$ ). Observing $n$ i.i.d. data pairs $(x_i, y_i)$ drawn from $P$ and denoting by $K \in \mathbb{R}^{n\times n}$ the Gram matrix, RRFF replaces $k$ by a Monte Carlo average:

$\tilde{k}(x, y) = \tfrac{1}{m} \sum_{j=1}^m z(v_j, x) z(v_j, y),$

where $v_j$ are drawn from $p$ (“plain RFF”) or a data-dependent density $q$ (“weighted RFF”).

In the context of empirical risk minimization with convex loss $l(y, f(x))$ and Tikhonov regularization parameter $\lambda$ , RRFF reduces infinite-dimensional kernel regression/classification to finite-dimensional linear problems:

$\min_{\beta \in \mathbb{R}^m} \frac{1}{n}\sum_{i=1}^n l(y_i, \tilde{k}(x_i, \cdot)^T \beta) + \lambda \|\beta\|_2^2.$

2. Risk Guarantees and Sample Complexity

Rigorous non-asymptotic risk bounds under both squared error and general Lipschitz loss have been established for RRFF (Li et al., 2018, Li, 2021). The key complexity control parameter is the effective degrees of freedom of the kernel,

$d_K^\lambda = \mathrm{Tr}\left[K(K + n\lambda I)^{-1}\right] = \sum_{i=1}^n \frac{\lambda_i}{\lambda_i + n\lambda},$

where $\{\lambda_i\}$ are the eigenvalues of $K$ .

For Kernel Ridge Regression (squared error loss), the risk excess of RRFF with $m$ features is—under mild moment assumptions—upper bounded by:

$\mathbb{E}[(y - \tilde{f}^\lambda_m(x))^2] - \mathbb{E}[(y - f_{\mathcal{H}}(x))^2] \leq 4\lambda + O(n^{-1/2}) + (\mathbb{E}[(y - \hat{f}^\lambda(x))^2] - \mathbb{E}[(y - f_{\mathcal{H}}(x))^2])$

as soon as $m \gtrsim d_{\tilde{\ell}}\,\log(d_K^\lambda/\delta)$ , where $d_{\tilde{\ell}}$ depends on the sampling density ( $d_{\tilde{\ell}} = z_0^2/\lambda$ for plain RFF, $d_{\tilde{\ell}} = d_K^\lambda$ for leverage-weighted RFF).

Refined Rates: Fast spectral decay of $K$ allows replacement of $O(n^{-1/2})$ with $O(n^{-1}), O((\log n)/n)$ , or better (Li et al., 2018).

For general convex Lipschitz losses (e.g., SVM or logistic regression), the same structure holds, but the bias term $4\lambda$ is replaced by $O(\sqrt{\lambda})$ , necessitating smaller $\lambda$ to attain minimax rates (Li, 2021).

3. Feature Selection Schemes and Leverage Weighting

Two principal feature sampling regimes are studied (Li et al., 2018, Li, 2021):

Plain RFF: Frequencies $v_j$ drawn i.i.d. from $p(v)$ ; sample complexity is $m = O(z_0^2/\lambda \cdot \log(d_K^\lambda/\delta))$ , which (with $\lambda \asymp n^{-1/2}$ ) is typically $O(\sqrt{n}\log n)$ .
Leverage-Weighted RFF: Sampling frequencies proportional to empirical ridge leverage scores

$l_\lambda(v) = p(v) z(v, x)^T (K + n\lambda I)^{-1} z(v, x),$

normalized to $q^*(v) = l_\lambda(v)/d_K^\lambda$ . This reduces the required features to $m = O(d_K^\lambda \log(d_K^\lambda/\delta))$ , which is often $O(1)$ or $\operatorname{polylog}(n)$ in benign regimes.

Since computing leverage scores $l_\lambda(v)$ exactly is computationally prohibitive, a fast two-stage approximation is used: a large pool of candidate features is sampled from $p(v)$ . The $K$ -based expressions are replaced by their empirical RFF proxy, and a smaller subset is then sampled using these approximate leverage weights, retaining the same statistical guarantees (Li et al., 2018).

Pseudocode: Approximate Leverage-Weighted RFF

Input: Data {(x_i, y_i)}_{i=1}^n, kernel k, regularization λ,
       pool-size s, target m ≪ s.
1. Draw v_1,…,v_s ∼ p(v), build Z_s ∈ ℝ^{n×s}, Z_s[i,j] = z(v_j, x_i)
2. Compute A = Z_s^T Z_s, G = (A/s + nλ I_s)^{-1}
3. For j=1,…,s, set w_j = [A G]_{jj}
4. Normalize p_j = w_j / (∑_j w_j)
5. Sample m indices according to p_j, output features and weights

4. Extensions: Frequency-Weighted Regularization and Operator Learning

Beyond ridge regularization, RRFF incorporates frequency-weighted penalties to suppress high-frequency noise, particularly beneficial for operator learning in Sobolev or Matérn spaces (Yu et al., 19 Dec 2025). Features can be drawn from heavy-tailed distributions (e.g., Student's $t$ ), and regularization weights may grow as a function of frequency, $w(\omega) = \|\Sigma^{-1/2} \omega\|^p$ or $w(\omega) = 1 + \|\omega\|^p$ .

This methodology extends to RRFF-FEM for operator learning, coupling the finite-dimensional RRFF regression with finite element reconstruction to produce stable $H^s$ -regular output functions. High-probability bounds on singular values of the random feature matrix guarantee well-conditioning and generalization once $N = O(m\log m)$ , with robustness to noise in practical PDE benchmarks (Yu et al., 19 Dec 2025).

5. Implicit Regularization and High-Dimensional Phenomena

RRFF with a finite number of features introduces implicit regularization beyond the explicit ridge penalty. In the Gaussian RFF model, for $P$ features and regularization $\lambda$ , the average RF predictor matches a KRR predictor with effective ridge $\tilde{\lambda} > \lambda$ , determined as the solution to

$\tilde\lambda = \lambda + \frac{\tilde\lambda}{\gamma} \frac{1}{N} \sum_{i=1}^N \frac{d_i}{\tilde\lambda + d_i},\quad \gamma = P/N$

The gap $\tilde{\lambda} - \lambda$ vanishes as $P \to \infty$ but accounts for most of the finite-sample bias/variance trade-off (Jacot et al., 2020).

In the double-asymptotic regime (large $n, p, N$ ), phase transitions and double descent emerge. At the interpolation threshold ($2N = n$), test error exhibits a singularity in the ridgeless regime ( $\lambda \to 0$ ). For $2N < n$ (under-parameterized), train/test errors increase with decreasing $N$ ; for $2N > n$ (over-parameterized), errors descend again, with best results in the mildly over-parameterized regime (Liao et al., 2020).

6. Computational Complexity and Practical Performance

RRFF reduces computational bottlenecks from $O(n^3)$ time and $O(n^2)$ memory (full kernel methods) to $O(ns^2 + s^3)$ time and $O(ns)$ memory, where $s$ is typically $O(\sqrt n \log n)$ or smaller with leverage-weighting (Li et al., 2018, Li, 2021). Operator learning benchmarks confirm reductions in both error and run time relative to unregularized RFF, with RRFF-FEM producing smoother, more stable reconstructions with lower error in noisy settings (Yu et al., 19 Dec 2025).

In classification settings with Lipschitz loss, plain RRFF achieves minimax $O(1/\sqrt n)$ risk with $O(\sqrt n\log n)$ features and fast $O(1/n)$ rates under low-noise, while leverage-weighted RRFF can achieve near-linear scaling in benign regimes (Li, 2021).

7. Summary and Typical Parameter Choices

RRFF unites kernel regularization and random feature approximation. Selecting

$m \approx d_K^\lambda \log d_K^\lambda$

features suffices to match the statistical performance of full-kernel methods, and leverage weighting further reduces $m$ by directly targeting the effective degrees of freedom. Frequency weighting and operator reconstruction via finite elements enable robust, scalable learning of function-to-function maps with stability in high-noise and high-dimensional regimes.

Feature Selection	Sample Complexity	Computational Cost
Plain RFF	$O(\sqrt n\log n)$	$O(ns^2 + s^3)$
Leverage-Weighted RFF	$O(d_K^\lambda\log d_K^\lambda)$	$O(ns^2 + s^3)$
RRFF-FEM (operator learn.)	$O(m\log m)$	$O(m^3)$ (FE solve)

RRFF methods have established theoretical risk guarantees, efficient algorithms for approximating leverage scores, phase-aware regularization effects, and demonstrated empirical superiority in regression, classification, and operator learning across a broad range of settings (Li et al., 2018, Li, 2021, Jacot et al., 2020, Yu et al., 19 Dec 2025, Liao et al., 2020).