Papers
Topics
Authors
Recent
Search
2000 character limit reached

Powerlaw Random Feature Model

Updated 6 February 2026
  • The model is a framework for high-dimensional random feature regression that employs power-law decay in feature spectra and target weights.
  • It provides non-asymptotic, dimension-free risk formulas that delineate the trade-offs among sample complexity, model size, and regularization.
  • The framework prescribes optimal learning rate schedules and training protocols for both ridge regression and SGD, validated by empirical studies.

The powerlaw random feature model is a framework for analyzing high-dimensional random feature regression schemes, where the spectrum of the feature covariance operator and the target function weights both exhibit power-law decay. This model enables rigorous characterization of generalization rates, test errors, and optimal training protocols for both ridge regression and stochastic gradient descent (SGD), offering non-asymptotic, dimension-free, and closed-form scaling laws. These analyses reveal precise phase diagrams for generalization, optimal trade-offs between sample complexity, model size, and regularization, as well as compute-optimal and training-optimal schedules under resource constraints (Defilippis et al., 2024, &&&1&&&).

1. Structure and Assumptions of the Powerlaw Random Feature Model

The model considers regression in a Hilbert feature space H=L2(μx)\mathcal{H}=L_2(\mu_x), either finite- or infinite-dimensional, with a feature-integral operator KK possessing eigenpairs (ξk,ψk)(\xi_k,\psi_k). The eigenvalues λk=ξk2\lambda_k = \xi_k^2 of KK are assumed to decay as a power law,

λkkα,α>1\lambda_k \propto k^{-\alpha}, \quad \alpha > 1

(parameter bb is used in alternate notation). The regression target ff_\star decomposes as

f=k1βkψk()f_\star = \sum_{k\geq 1} \beta_k \psi_k(\cdot)

with coefficient decay βkk(β+1/2)\beta_k \propto k^{-(\beta+1/2)} and a source exponent β>0\beta > 0.

For random feature regression, nn i.i.d. samples (xi,yi)(x_i, y_i) are drawn, with yi=f(xi)+ϵiy_i = f_\star(x_i) + \epsilon_i, and the random feature map is

zj(x)=p1/2φ(x,wj),j=1,...,p,wjμw.z_j(x) = p^{-1/2}\,\varphi(x, w_j), \quad j=1, ..., p, \quad w_j \sim \mu_w.

The model is analyzed both for ridge regression with finite pp and also in the context of SGD, considering both learning rate schedules and batch-size protocols (Defilippis et al., 2024, Bordelon et al., 4 Feb 2026).

2. Deterministic Equivalent Test Error: Non-Asymptotic, Dimension-Free Risk Formulas

The excess risk for random feature ridge regression (RFRR) is given by

R(n,p,λ)=Ex[(f(x)f^λ(x))2]=Bias+VarianceR(n,p,\lambda) = \mathbb{E}_x[(f_\star(x) - \hat{f}_\lambda(x))^2] = \text{Bias} + \text{Variance}

Under a concentration condition on the random features (Assumption 3.1), the risk admits a dimension-free deterministic equivalent:

R(n,p,λ)=R^n,p+O((n1/2+p1/2)R^n,p)R(n,p,\lambda) = \widehat{R}_{n,p} + O((n^{-1/2} + p^{-1/2}) \widehat{R}_{n,p})

where R^n,p\widehat{R}_{n,p} depends only on the feature spectrum {ξk2}\{\xi_k^2\}, the target weights {βk}\{\beta_k\}, regularization parameter λ\lambda, nn, and pp. The closed-form is:

  • Solve for ν2>0\nu_2>0 via

1+np(1np)2+4λpν2=2pk=1ξk2ξk2+ν21 + \frac{n}{p} - \sqrt{(1 - \frac{n}{p})^2 + \frac{4\lambda}{p\nu_2}} = \frac{2}{p} \sum_{k=1}^\infty \frac{\xi_k^2}{\xi_k^2 + \nu_2}

  • Set

ν1=ν22[1np+(1np)2+4λpν2]\nu_1 = \frac{\nu_2}{2} \left[1 - \frac{n}{p} + \sqrt{(1 - \frac{n}{p})^2 + \frac{4\lambda}{p\nu_2}}\,\right]

  • Define

U=pn[(1ν1ν2)2+(ν1ν2)2χ(ν2)],χ(ν2)=kξk4/(ξk2+ν2)2pkξk4/(ξk2+ν2)2U = \frac{p}{n}\left[\left(1 - \frac{\nu_1}{\nu_2}\right)^2 + \left(\frac{\nu_1}{\nu_2}\right)^2 \chi(\nu_2)\right], \qquad \chi(\nu_2) = \frac{\sum_k \xi_k^4/(\xi_k^2+\nu_2)^2}{p - \sum_k \xi_k^4/(\xi_k^2+\nu_2)^2}

  • Bias and variance contributions:

Bias^=ν221U[kβk2(ξk2+ν2)2+χ(ν2)kβk2(ξk2+ν2)2] Var^=σ2U1U\begin{align*} \widehat{\text{Bias}} &= \frac{\nu_2^2}{1-U} \left[\sum_k \frac{\beta_k^2}{(\xi_k^2+\nu_2)^2} + \chi(\nu_2) \sum_k \frac{\beta_k^2}{(\xi_k^2+\nu_2)^2} \right] \ \widehat{\text{Var}} &= \sigma^2 \frac{U}{1-U} \end{align*}

So R^n,p=Bias^+Var^\widehat{R}_{n,p} = \widehat{\text{Bias}} + \widehat{\text{Var}}. This deterministic equivalent is non-asymptotic (no large-sample assumption), multiplicative (relative error is controlled), and dimension-free (applicable regardless of the ambient or effective feature dimension) (Defilippis et al., 2024).

3. Sharp Scaling Laws and Minimax Rates Under Powerlaw Decay

When the power-law assumptions are imposed on both spectrum and target coefficients,

ξk2=Ckα,βk=k(β+1/2)\xi_k^2 = C k^{-\alpha}, \quad \beta_k = k^{-(\beta+1/2)}

and setting p=nqp = n^q, λ=n(1)\lambda = n^{-(\ell-1)} for q,0q, \ell \geq 0, explicit scaling exponents for the risk are derived,

R^n,p=Θ(nγB(,q)+σ2nγV(,q)),\widehat{R}_{n,p} = \Theta\left(n^{-\gamma_B(\ell, q)} + \sigma^2 n^{-\gamma_V(\ell, q)}\right),

where

γV(,q)=1min{q,/α,1}\gamma_V(\ell, q) = 1 - \min\{q, \ell/\alpha, 1 \}

and

γB(,q)=min{2αmin(q,/α,1)min(β,1),[2αmin(β,1/2)1]min(q,/α,1)+q}.\gamma_B(\ell, q) = \min\left\{2\alpha\cdot \min(q, \ell/\alpha, 1)\cdot \min(\beta,1), \: [2\alpha\cdot\min(\beta,1/2)-1]\cdot\min(q,\ell/\alpha,1)+q\right\}.

The overall risk exponent is γ(,q)=min{γB,γV}\gamma(\ell, q) = \min\{\gamma_B, \gamma_V\}.

The minimax-optimal (fastest) rate

γ=2αmin(β,1)2αmin(β,1)+1\gamma^* = \frac{2\alpha \min(\beta,1)}{2\alpha \min(\beta,1)+1}

is achieved by

=α2αmin(β,1)+1,      q=1(2β1)\ell^* = \frac{\alpha}{2\alpha \min(\beta,1)+1}, \;\;\; q^* = 1 - (2\beta \wedge 1)\ell^*

implying the minimal number of random features to attain minimax rates is p=nqp^* = n^{q^*} with regularization λ=n(1)\lambda^* = n^{-(\ell^*-1)} (Defilippis et al., 2024).

4. SGD Dynamics, Optimal Learning Rate Scheduling, and Training Phases

In SGD-based training of powerlaw random feature regression, the evolution of the mean-square error in each spectral coordinate is tracked, leading to a continuous-time optimal control formulation for both the learning rate η(t)\eta(t) and batch size m(t)m(t). Two distinct regimes (phases) emerge:

  • Easy phase (b<ab < a): The optimal learning rate schedule is a polynomial decay,

ηT(t)=Tξ(1t/T)δ,ξ=1ba,δ=2b1\eta_T^*(t) = T^{-\xi} (1-t/T)^{\delta}, \quad \xi = 1 - \frac{b}{a}, \quad \delta = 2b-1

and the excess loss decays as LTσ2T(a1)/aL_T - \sigma^2 \sim T^{-(a-1)/a}.

  • Hard phase (b>ab > a): The optimal schedule exhibits a warmup–stable–decay shape,

ηT(t)={ηmax,0t<ts ηmax(1t/T1ts/T)2b1,tstT\eta_T^*(t) = \begin{cases} \eta_{\max}, & 0 \leq t < t_s \ \eta_{\max} \left( \dfrac{1-t/T}{1-t_s/T} \right)^{2b-1}, & t_s \leq t \leq T \end{cases}

where 1ts/TT(ba)/(2b1)1-t_s/T \sim T^{-(b-a)/(2b-1)}, allocating most of the training to a fixed learning rate and a vanishing fraction to annealing. Here, LTσ2T(a1)/bL_T - \sigma^2 \sim T^{-(a-1)/b}. The optimal batch size similarly follows a schedule driven by the same variational principle (Bordelon et al., 4 Feb 2026).

These schedules outperform constant or simple power-law learning rate protocols, and the optimal exponents are not attainable by “anytime” policies that ignore training horizon (Bordelon et al., 4 Feb 2026).

5. Special Cases, Regularization, and Phase Transitions

The model encompasses several noteworthy limits and phase phenomena:

  • Kernel regime (pp \to \infty): The theory reduces to kernel ridge regression, with a univariate fixed-point for the variance parameter.
  • Approximation-limit (nn \to \infty): Risk is determined purely by the bias incurred due to model truncation.
  • Interpolation cusp: At the critical point n=pn=p and λ0\lambda \rightarrow 0, the risk diverges as U1U \to 1, manifesting the “double-descent” phenomenon.
  • Regularization trade-off: The parameter λ\lambda tunes the bias-variance balance precisely, with its optimal scaling (\ell^*) explicitly characterized.
  • Minimax optimality: The model quantifies the minimal required number of features pp^* necessary for minimax generalization rates, often implying significant reduction in model size relative to nn.

This phase diagram, accessible through explicit formulas, extends classical results on kernel learning rates to the more general random feature context (Defilippis et al., 2024).

6. Compute-Optimal Scaling, Mini-batch Protocols, and Momentum Extensions

When model size (NN) and training horizon (TT) are optimized jointly for a fixed compute budget (C=N×TC = N \times T), the theory predicts

  • For b<ab < a: NC1/(a+1)N \sim C^{1/(a+1)}, TCa/(a+1)T \sim C^{a/(a+1)}, LCσ02C(a1)/(a+1)L_C-\sigma_0^2 \sim C^{-(a-1)/(a+1)}
  • For b>ab > a: NC1/(b+1)N \sim C^{1/(b+1)}, TCb/(b+1)T \sim C^{b/(b+1)}, LCσ02C(a1)/(b+1)L_C-\sigma_0^2 \sim C^{-(a-1)/(b+1)}

For fixed sample budget, the generalization error scales as Btot(a1)/aB_{\text{tot}}^{-(a-1)/a} (easy) or Btot(a1)/bB_{\text{tot}}^{-(a-1)/b} (hard) as a function of the total number of samples processed.

Including time-varying momentum β(t)\beta(t) in optimization, further improvements are possible. In the easy phase, optimal β(t)\beta(t) only slightly affects constants, but in the hard phase, joint optimization yields strictly faster decay exponents than baseline SGD (Bordelon et al., 4 Feb 2026).

7. Practical Implications and Empirical Validation

The deterministic equivalents and resulting scaling laws directly inform the optimal selection of regularization parameter λ\lambda and random feature count pp for generalization, and prescribe precise learning rate and batch size schedules for SGD training. This dimension-free theory is empirically validated on a wide range of real and synthetic tasks, capturing phase transitions, risk minima, and interpolation artifacts observed in practice.

The analysis provides rigorous guarantees even in infinite-dimensional feature spaces, extending classical kernel learning results to model classes where random feature methods are employed. The theory reveals that with appropriate tuning—guided by the powerlaw decay exponents and explicit closed-form solutions—optimal generalization often requires far fewer random features than samples, and that sophisticated learning rate schedules and joint optimization of minibatch size and momentum can further enhance learning efficiency (Defilippis et al., 2024, Bordelon et al., 4 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Powerlaw Random Feature Model.