Regularized Random Fourier Features (RRFF)
- RRFF are advanced kernel approximations that integrate explicit regularization with data-dependent feature selection for efficient, statistically sound learning.
- The method balances kernel regularization and feature truncation via plain or leverage-weighted sampling to reduce sample complexity and achieve minimax rates.
- RRFF extends to operator learning and high-dimensional regimes, revealing phenomena such as implicit regularization, double descent, and stable reconstructions.
Regularized Random Fourier Features (RRFF) are a class of random feature approximations for kernel methods that incorporate explicit regularization and data-dependent feature selection to achieve sharp statistical guarantees and significant reductions in computational cost relative to classical kernel learning. RRFF systematically balances kernel regularization with random feature truncation—using the number and distribution of features, frequency-weighted penalties, and potentially empirical (leverage-score) weighting—yielding algorithms that attain minimax learning rates with dramatically fewer features, provide robust operator generalization in the presence of noise, and reveal new phenomena such as implicit regularization and double descent in high-dimensional regimes.
1. Mathematical Foundations and Problem Setup
RRFF begins with a shift-invariant positive definite kernel on , possessing a Bochner spectral representation:
where is a probability density over , and is typically trigonometric (e.g., for ). Observing i.i.d. data pairs drawn from and denoting by the Gram matrix, RRFF replaces by a Monte Carlo average:
where are drawn from (“plain RFF”) or a data-dependent density (“weighted RFF”).
In the context of empirical risk minimization with convex loss and Tikhonov regularization parameter , RRFF reduces infinite-dimensional kernel regression/classification to finite-dimensional linear problems:
2. Risk Guarantees and Sample Complexity
Rigorous non-asymptotic risk bounds under both squared error and general Lipschitz loss have been established for RRFF (Li et al., 2018, Li, 2021). The key complexity control parameter is the effective degrees of freedom of the kernel,
where are the eigenvalues of .
For Kernel Ridge Regression (squared error loss), the risk excess of RRFF with features is—under mild moment assumptions—upper bounded by:
as soon as , where depends on the sampling density ( for plain RFF, for leverage-weighted RFF).
Refined Rates: Fast spectral decay of allows replacement of with , or better (Li et al., 2018).
For general convex Lipschitz losses (e.g., SVM or logistic regression), the same structure holds, but the bias term is replaced by , necessitating smaller to attain minimax rates (Li, 2021).
3. Feature Selection Schemes and Leverage Weighting
Two principal feature sampling regimes are studied (Li et al., 2018, Li, 2021):
- Plain RFF: Frequencies drawn i.i.d. from ; sample complexity is , which (with ) is typically .
- Leverage-Weighted RFF: Sampling frequencies proportional to empirical ridge leverage scores
normalized to . This reduces the required features to , which is often or in benign regimes.
Since computing leverage scores exactly is computationally prohibitive, a fast two-stage approximation is used: a large pool of candidate features is sampled from . The -based expressions are replaced by their empirical RFF proxy, and a smaller subset is then sampled using these approximate leverage weights, retaining the same statistical guarantees (Li et al., 2018).
Pseudocode: Approximate Leverage-Weighted RFF
1 2 3 4 5 6 7 |
Input: Data {(x_i, y_i)}_{i=1}^n, kernel k, regularization λ,
pool-size s, target m ≪ s.
1. Draw v_1,…,v_s ∼ p(v), build Z_s ∈ ℝ^{n×s}, Z_s[i,j] = z(v_j, x_i)
2. Compute A = Z_s^T Z_s, G = (A/s + nλ I_s)^{-1}
3. For j=1,…,s, set w_j = [A G]_{jj}
4. Normalize p_j = w_j / (∑_j w_j)
5. Sample m indices according to p_j, output features and weights |
4. Extensions: Frequency-Weighted Regularization and Operator Learning
Beyond ridge regularization, RRFF incorporates frequency-weighted penalties to suppress high-frequency noise, particularly beneficial for operator learning in Sobolev or Matérn spaces (Yu et al., 19 Dec 2025). Features can be drawn from heavy-tailed distributions (e.g., Student's ), and regularization weights may grow as a function of frequency, or .
This methodology extends to RRFF-FEM for operator learning, coupling the finite-dimensional RRFF regression with finite element reconstruction to produce stable -regular output functions. High-probability bounds on singular values of the random feature matrix guarantee well-conditioning and generalization once , with robustness to noise in practical PDE benchmarks (Yu et al., 19 Dec 2025).
5. Implicit Regularization and High-Dimensional Phenomena
RRFF with a finite number of features introduces implicit regularization beyond the explicit ridge penalty. In the Gaussian RFF model, for features and regularization , the average RF predictor matches a KRR predictor with effective ridge , determined as the solution to
The gap vanishes as but accounts for most of the finite-sample bias/variance trade-off (Jacot et al., 2020).
In the double-asymptotic regime (large ), phase transitions and double descent emerge. At the interpolation threshold ($2N = n$), test error exhibits a singularity in the ridgeless regime (). For $2N < n$ (under-parameterized), train/test errors increase with decreasing ; for $2N > n$ (over-parameterized), errors descend again, with best results in the mildly over-parameterized regime (Liao et al., 2020).
6. Computational Complexity and Practical Performance
RRFF reduces computational bottlenecks from time and memory (full kernel methods) to time and memory, where is typically or smaller with leverage-weighting (Li et al., 2018, Li, 2021). Operator learning benchmarks confirm reductions in both error and run time relative to unregularized RFF, with RRFF-FEM producing smoother, more stable reconstructions with lower error in noisy settings (Yu et al., 19 Dec 2025).
In classification settings with Lipschitz loss, plain RRFF achieves minimax risk with features and fast rates under low-noise, while leverage-weighted RRFF can achieve near-linear scaling in benign regimes (Li, 2021).
7. Summary and Typical Parameter Choices
RRFF unites kernel regularization and random feature approximation. Selecting
features suffices to match the statistical performance of full-kernel methods, and leverage weighting further reduces by directly targeting the effective degrees of freedom. Frequency weighting and operator reconstruction via finite elements enable robust, scalable learning of function-to-function maps with stability in high-noise and high-dimensional regimes.
| Feature Selection | Sample Complexity | Computational Cost |
|---|---|---|
| Plain RFF | ||
| Leverage-Weighted RFF | ||
| RRFF-FEM (operator learn.) | (FE solve) |
RRFF methods have established theoretical risk guarantees, efficient algorithms for approximating leverage scores, phase-aware regularization effects, and demonstrated empirical superiority in regression, classification, and operator learning across a broad range of settings (Li et al., 2018, Li, 2021, Jacot et al., 2020, Yu et al., 19 Dec 2025, Liao et al., 2020).