Powerlaw Random Feature Model

Updated 6 February 2026

The model is a framework for high-dimensional random feature regression that employs power-law decay in feature spectra and target weights.
It provides non-asymptotic, dimension-free risk formulas that delineate the trade-offs among sample complexity, model size, and regularization.
The framework prescribes optimal learning rate schedules and training protocols for both ridge regression and SGD, validated by empirical studies.

The powerlaw random feature model is a framework for analyzing high-dimensional random feature regression schemes, where the spectrum of the feature covariance operator and the target function weights both exhibit power-law decay. This model enables rigorous characterization of generalization rates, test errors, and optimal training protocols for both ridge regression and stochastic gradient descent (SGD), offering non-asymptotic, dimension-free, and closed-form scaling laws. These analyses reveal precise phase diagrams for generalization, optimal trade-offs between sample complexity, model size, and regularization, as well as compute-optimal and training-optimal schedules under resource constraints (Defilippis et al., 2024, &&&1&&&).

1. Structure and Assumptions of the Powerlaw Random Feature Model

The model considers regression in a Hilbert feature space $\mathcal{H}=L_2(\mu_x)$ , either finite- or infinite-dimensional, with a feature-integral operator $K$ possessing eigenpairs $(\xi_k,\psi_k)$ . The eigenvalues $\lambda_k = \xi_k^2$ of $K$ are assumed to decay as a power law,

$\lambda_k \propto k^{-\alpha}, \quad \alpha > 1$

(parameter $b$ is used in alternate notation). The regression target $f_\star$ decomposes as

$f_\star = \sum_{k\geq 1} \beta_k \psi_k(\cdot)$

with coefficient decay $\beta_k \propto k^{-(\beta+1/2)}$ and a source exponent $\beta > 0$ .

For random feature regression, $n$ i.i.d. samples $(x_i, y_i)$ are drawn, with $y_i = f_\star(x_i) + \epsilon_i$ , and the random feature map is

$z_j(x) = p^{-1/2}\,\varphi(x, w_j), \quad j=1, ..., p, \quad w_j \sim \mu_w.$

The model is analyzed both for ridge regression with finite $p$ and also in the context of SGD, considering both learning rate schedules and batch-size protocols (Defilippis et al., 2024, Bordelon et al., 4 Feb 2026).

2. Deterministic Equivalent Test Error: Non-Asymptotic, Dimension-Free Risk Formulas

The excess risk for random feature ridge regression (RFRR) is given by

$R(n,p,\lambda) = \mathbb{E}_x[(f_\star(x) - \hat{f}_\lambda(x))^2] = \text{Bias} + \text{Variance}$

Under a concentration condition on the random features (Assumption 3.1), the risk admits a dimension-free deterministic equivalent:

$R(n,p,\lambda) = \widehat{R}_{n,p} + O((n^{-1/2} + p^{-1/2}) \widehat{R}_{n,p})$

where $\widehat{R}_{n,p}$ depends only on the feature spectrum $\{\xi_k^2\}$ , the target weights $\{\beta_k\}$ , regularization parameter $\lambda$ , $n$ , and $p$ . The closed-form is:

Solve for $\nu_2>0$ via

$1 + \frac{n}{p} - \sqrt{(1 - \frac{n}{p})^2 + \frac{4\lambda}{p\nu_2}} = \frac{2}{p} \sum_{k=1}^\infty \frac{\xi_k^2}{\xi_k^2 + \nu_2}$

$\nu_1 = \frac{\nu_2}{2} \left[1 - \frac{n}{p} + \sqrt{(1 - \frac{n}{p})^2 + \frac{4\lambda}{p\nu_2}}\,\right]$

Define

$U = \frac{p}{n}\left[\left(1 - \frac{\nu_1}{\nu_2}\right)^2 + \left(\frac{\nu_1}{\nu_2}\right)^2 \chi(\nu_2)\right], \qquad \chi(\nu_2) = \frac{\sum_k \xi_k^4/(\xi_k^2+\nu_2)^2}{p - \sum_k \xi_k^4/(\xi_k^2+\nu_2)^2}$

Bias and variance contributions:

$\begin{align*} \widehat{\text{Bias}} &= \frac{\nu_2^2}{1-U} \left[\sum_k \frac{\beta_k^2}{(\xi_k^2+\nu_2)^2} + \chi(\nu_2) \sum_k \frac{\beta_k^2}{(\xi_k^2+\nu_2)^2} \right] \ \widehat{\text{Var}} &= \sigma^2 \frac{U}{1-U} \end{align*}$

So $\widehat{R}_{n,p} = \widehat{\text{Bias}} + \widehat{\text{Var}}$ . This deterministic equivalent is non-asymptotic (no large-sample assumption), multiplicative (relative error is controlled), and dimension-free (applicable regardless of the ambient or effective feature dimension) (Defilippis et al., 2024).

3. Sharp Scaling Laws and Minimax Rates Under Powerlaw Decay

When the power-law assumptions are imposed on both spectrum and target coefficients,

$\xi_k^2 = C k^{-\alpha}, \quad \beta_k = k^{-(\beta+1/2)}$

and setting $p = n^q$ , $\lambda = n^{-(\ell-1)}$ for $q, \ell \geq 0$ , explicit scaling exponents for the risk are derived,

$\widehat{R}_{n,p} = \Theta\left(n^{-\gamma_B(\ell, q)} + \sigma^2 n^{-\gamma_V(\ell, q)}\right),$

where

$\gamma_V(\ell, q) = 1 - \min\{q, \ell/\alpha, 1 \}$

and

$\gamma_B(\ell, q) = \min\left\{2\alpha\cdot \min(q, \ell/\alpha, 1)\cdot \min(\beta,1), \: [2\alpha\cdot\min(\beta,1/2)-1]\cdot\min(q,\ell/\alpha,1)+q\right\}.$

The overall risk exponent is $\gamma(\ell, q) = \min\{\gamma_B, \gamma_V\}$ .

The minimax-optimal (fastest) rate

$\gamma^* = \frac{2\alpha \min(\beta,1)}{2\alpha \min(\beta,1)+1}$

is achieved by

$\ell^* = \frac{\alpha}{2\alpha \min(\beta,1)+1}, \;\;\; q^* = 1 - (2\beta \wedge 1)\ell^*$

implying the minimal number of random features to attain minimax rates is $p^* = n^{q^*}$ with regularization $\lambda^* = n^{-(\ell^*-1)}$ (Defilippis et al., 2024).

4. SGD Dynamics, Optimal Learning Rate Scheduling, and Training Phases

In SGD-based training of powerlaw random feature regression, the evolution of the mean-square error in each spectral coordinate is tracked, leading to a continuous-time optimal control formulation for both the learning rate $\eta(t)$ and batch size $m(t)$ . Two distinct regimes (phases) emerge:

Easy phase ( $b < a$ ): The optimal learning rate schedule is a polynomial decay,

$\eta_T^*(t) = T^{-\xi} (1-t/T)^{\delta}, \quad \xi = 1 - \frac{b}{a}, \quad \delta = 2b-1$

and the excess loss decays as $L_T - \sigma^2 \sim T^{-(a-1)/a}$ .

Hard phase ( $b > a$ ): The optimal schedule exhibits a warmup–stable–decay shape,

$\eta_T^*(t) = \begin{cases} \eta_{\max}, & 0 \leq t < t_s \ \eta_{\max} \left( \dfrac{1-t/T}{1-t_s/T} \right)^{2b-1}, & t_s \leq t \leq T \end{cases}$

where $1-t_s/T \sim T^{-(b-a)/(2b-1)}$ , allocating most of the training to a fixed learning rate and a vanishing fraction to annealing. Here, $L_T - \sigma^2 \sim T^{-(a-1)/b}$ . The optimal batch size similarly follows a schedule driven by the same variational principle (Bordelon et al., 4 Feb 2026).

These schedules outperform constant or simple power-law learning rate protocols, and the optimal exponents are not attainable by “anytime” policies that ignore training horizon (Bordelon et al., 4 Feb 2026).

5. Special Cases, Regularization, and Phase Transitions

The model encompasses several noteworthy limits and phase phenomena:

Kernel regime ( $p \to \infty$ ): The theory reduces to kernel ridge regression, with a univariate fixed-point for the variance parameter.
Approximation-limit ( $n \to \infty$ ): Risk is determined purely by the bias incurred due to model truncation.
Interpolation cusp: At the critical point $n=p$ and $\lambda \rightarrow 0$ , the risk diverges as $U \to 1$ , manifesting the “double-descent” phenomenon.
Regularization trade-off: The parameter $\lambda$ tunes the bias-variance balance precisely, with its optimal scaling ( $\ell^*$ ) explicitly characterized.
Minimax optimality: The model quantifies the minimal required number of features $p^*$ necessary for minimax generalization rates, often implying significant reduction in model size relative to $n$ .

This phase diagram, accessible through explicit formulas, extends classical results on kernel learning rates to the more general random feature context (Defilippis et al., 2024).

6. Compute-Optimal Scaling, Mini-batch Protocols, and Momentum Extensions

When model size ( $N$ ) and training horizon ( $T$ ) are optimized jointly for a fixed compute budget ( $C = N \times T$ ), the theory predicts

For $b < a$ : $N \sim C^{1/(a+1)}$ , $T \sim C^{a/(a+1)}$ , $L_C-\sigma_0^2 \sim C^{-(a-1)/(a+1)}$
For $b > a$ : $N \sim C^{1/(b+1)}$ , $T \sim C^{b/(b+1)}$ , $L_C-\sigma_0^2 \sim C^{-(a-1)/(b+1)}$

For fixed sample budget, the generalization error scales as $B_{\text{tot}}^{-(a-1)/a}$ (easy) or $B_{\text{tot}}^{-(a-1)/b}$ (hard) as a function of the total number of samples processed.

Including time-varying momentum $\beta(t)$ in optimization, further improvements are possible. In the easy phase, optimal $\beta(t)$ only slightly affects constants, but in the hard phase, joint optimization yields strictly faster decay exponents than baseline SGD (Bordelon et al., 4 Feb 2026).

7. Practical Implications and Empirical Validation

The deterministic equivalents and resulting scaling laws directly inform the optimal selection of regularization parameter $\lambda$ and random feature count $p$ for generalization, and prescribe precise learning rate and batch size schedules for SGD training. This dimension-free theory is empirically validated on a wide range of real and synthetic tasks, capturing phase transitions, risk minima, and interpolation artifacts observed in practice.

The analysis provides rigorous guarantees even in infinite-dimensional feature spaces, extending classical kernel learning results to model classes where random feature methods are employed. The theory reveals that with appropriate tuning—guided by the powerlaw decay exponents and explicit closed-form solutions—optimal generalization often requires far fewer random features than samples, and that sophisticated learning rate schedules and joint optimization of minibatch size and momentum can further enhance learning efficiency (Defilippis et al., 2024, Bordelon et al., 4 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Dimension-free deterministic equivalents and scaling laws for random feature regression (2024)

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Powerlaw Random Feature Model.