Deep Recursive Feature Machine

Updated 29 January 2026

Deep Recursive Feature Machine is a recursive framework that employs AGOP and random feature expansion to unify kernel methods with hierarchical deep learning for interpretable, backpropagation-free modeling.
It leverages adaptive kernel regression and spectral analysis to provide theoretical guarantees such as universality and precise bias-variance trade-offs while demonstrating deep neural collapse.
Practical implementations span diverse domains—including tabular data, vision, and generative music—outperforming traditional models with robust accuracy and explainability.

The Deep Recursive Feature Machine (Deep RFM) is a general framework for data-driven, hierarchical feature learning that bridges kernel methods and modern deep networks. It employs recursive application of the average gradient outer product (AGOP) and random feature expansion, yielding an interpretable, backpropagation-free analogue to deep neural networks. Deep RFM unifies statistical kernel ridging, spectral analysis, and geometric denoising, and has been studied through rigorous asymptotic analysis, theoretical characterization of neural collapse, interpretability applications, and steering in generative architectures (Beaglehole et al., 2024, Radhakrishnan et al., 2022, Bosch et al., 2023, Shen et al., 2024, Zhao et al., 21 Oct 2025).

1. Mathematical Formulation and Construction

The canonical Deep RFM is defined by a recursive sequence alternating between adaptive kernel regression (often Kernel Ridge Regression, KRR) and representation updates via the AGOP. The paradigm extends naturally to both random feature maps and parametric neural architectures.

Let $\{(x_i, y_i)\}_{i=1}^n$ be data, $x_i \in \mathbb{R}^d$ , $y_i \in \mathbb{R}^K$ . The deep random feature variant of RFM constructs features as follows (Bosch et al., 2023):

Initialize $x^{(0)} = x$ .
For layers $\ell = 1, \ldots, L$ ,

$x^{(\ell)} = \sigma \big( W^{(\ell)} x^{(\ell-1)} \big), \quad W^{(\ell)} \sim \mathcal{N}(0, p_{\ell-1}^{-1} I)$

where $\sigma$ is an entrywise activation.

In the AGOP-based kernel setting (Radhakrishnan et al., 2022, Shen et al., 2024):

Start with $M^{(0)} = I$ .
At each iteration $t$ $t$ :
1. Solve KRR in kernel $K^{(t)}(x, x') = k(x, x'; M^{(t)})$ (e.g., Gaussian, Laplace, Matern kernels with Mahalanobis metric parameterized by $M$ ).
2. Predict $f^{(t)}(x) = \sum_j a^{(t)}_j k(x, x_j; M^{(t)})$ .
3. Compute
  
  $M^{(t+1)} = \frac{1}{n} \sum_{i=1}^n \nabla_x f^{(t)}(x_i) \nabla_x f^{(t)}(x_i)^{T}$

The AGOP step projects features onto directions that most influence the output, directly mirroring the evolution of weight Gram matrices $W_\ell^T W_\ell$ in SGD-trained deep networks ("Neural Feature Ansatz") (Radhakrishnan et al., 2022).

The Deep RFM iteration—kernel fit, gradient-based operator update, feature expansion—can be implemented layerwise within neural or kernel-based machinery, and is extensible to autoregressive and diffusion models for conditional generation (Zhao et al., 21 Oct 2025).

2. Theoretical Foundations: Universality and Asymptotics

A central theoretical result is the universality and moment-matching equivalence between deep nonlinear random feature models and their corresponding deep linear Gaussian surrogates (Bosch et al., 2023). Under proportional-growth scaling and technical conditions (convexity, regularity, Gaussian weights, smooth/odd activations), the Deep RFM with random features is asymptotically equivalent both in training error and output functionals to a Gaussian model matching the first and second moments at each layer: $\tilde x^{(\ell)} = \rho_{1,\ell} W^{(\ell)} \tilde x^{(\ell-1)} + \rho_{2,\ell} g^{(\ell)}, \quad g^{(\ell)} \sim \mathcal{N}(0, I)$ with $\rho_{1,\ell} = \mathbb{E}[\sigma'(\alpha_{\ell-1} z)]$ , $\rho_{2,\ell}$ computed to match the variance, and input variance $\alpha_\ell^2$ recursively defined.

Applying the Convex Gaussian Min-Max Theorem (CGMT) yields closed-form scalar equations for the asymptotic train and test MSE, allowing precise quantification of the bias-variance tradeoffs imposed by architecture depth, width, activation statistics, and regularization (Bosch et al., 2023).

3. AGOP and Deep Neural Collapse

Deep RFM models provide an explicit, constructive demonstration of Deep Neural Collapse (DNC), a regime in which within-class variability vanishes and class means form a simplex equiangular tight frame in the representation space (Beaglehole et al., 2024). The key mechanisms:

Each AGOP projection $M_{\ell}^{1/2}X_\ell$ denoises and contracts within-class scatter ("NC1") rapidly (exponentially in depth).
Random feature maps (Gaussian+ReLU) alone do not contract within-class variance and cannot induce collapse.
The spectral geometry of AGOP aligns with the singular structure in trained DNNs' weight matrices.

Theoretical results rigorously establish that the Deep RFM Gram matrix converges exponentially towards the perfect block-ETF configuration characterizing optimal DNC, and that AGOP-based kernel learning drives the kernel towards the optimal NC configuration in both asymptotic and non-asymptotic settings.

4. Spectral and Feature Dynamics Across Layers

The spectral evolution of RFM features is central to its regularization and generalization properties. In the random feature realization, the covariance recursion

$R^{(\ell)} = \rho_{1,\ell}^2 W^{(\ell)} R^{(\ell-1)} W^{(\ell)T} + \rho_{2,\ell}^2 I$

induces a Lyapunov-type update, whose eigenvalue distribution at each layer can be tracked via recursion on the Stieltjes transform (Theorem 5.1) (Bosch et al., 2023). Increased hidden layer width concentrates spectral mass near zero, raising bias and shrinkage, while narrower layers flatten spectra. The effect is that depth and width jointly determine the effective regularization of the subsequent ridge regression or classification layer.

Empirically, spectral decay and rank-selection induced by AGOP recursions directly filter spurious or redundant features, enabling interpretable and robust data-driven dimensionality reduction (Shen et al., 2024).

5. Practical Implementation and Domain Extensions

The general Deep RFM algorithm is instantiated in multiple modalities and domains:

Standard tabular and QSPR: KRR with AGOP updates using Laplace, Matern, Gaussian, or Rational Quadratic kernels on domain-specific features (e.g., molecular fingerprints) (Shen et al., 2024).
Vision and small-image tasks: AGOP-based feature learning matches the principal axes discovered by deep neural networks (Radhakrishnan et al., 2022).
Autoregressive music generation: RFM probes train on hidden state activations, extract axis-aligned concept directions (AGOP eigenvectors), and inject them at inference for real-time, interpretable control (Zhao et al., 21 Oct 2025). Time-varying schedules (linear, exponential, logistic) and multi-directional steering are supported with quantitative improvements in both concept accuracy and prompt adherence.

General implementation relies on closed-form solutions for KRR, eigenvector decomposition of AGOP matrices, and sparse or diagonal approximations for large-scale problems. Recursion depth $T$ is typically $2$–$5$ (tabular/QSPR) or up to $20$ for neural collapse experiments. Empirical evaluation demonstrates state-of-the-art performance in tabular settings, robust redundancy filtering, and clear interpretability via local and global feature importance scoring (Shen et al., 2024).

6. Interpretability and Feature Importance

A distinguishing property of Deep RFM is the direct interpretability inherited from the AGOP update. The learned feature matrix $M$ quantifies sensitivity directions:

Per-sample/local importance: $\mathrm{score}(x_i) = x_i^T M x_i$
Global/dataset-level importance: $\overline{\mathrm{score}} = \frac{1}{n}\sum x_i^T M x_i$

Feature rankings derived from $M$ closely match permutation importance and SHAP rankings, establishing RFM as a competitive, native interpretable learner in scientific applications (Shen et al., 2024). By adaptively downweighting redundant or irrelevant features, RFM also exhibits resilience to overparameterization.

7. Comparative Performance and Applications

Deep RFM attains or surpasses the accuracy of advanced tree ensembles (e.g., XGBoost), transformer variants, and graph neural networks on diverse tasks. For molecular property prediction, RFM-HF (with a multi-scale hybrid fingerprint) achieves a median RMSE of $0.24 \pm 0.02$ on ESOL, significantly outperforming GNN baselines (Shen et al., 2024). In classification and regression benchmarks, RFM demonstrates top mean $R^2$ and accuracy metrics, while requiring a fraction of the computational resources (Radhakrishnan et al., 2022).

In generative modeling, MusicRFM enables steerable musical attribute generation with interpretable axis-aligned controls, raising the probe accuracy for targeted features from $0.23$ to $0.82$ while maintaining high prompt adherence (Zhao et al., 21 Oct 2025).

Deep Recursive Feature Machines thus unify adaptive feature learning, spectral denoising, and scalable, interpretable nonparametric prediction. Their theoretical guarantees, empirical results, and broad applicability establish Deep RFM as a general-purpose bridge between kernel methods and hierarchical deep learning architectures (Bosch et al., 2023, Radhakrishnan et al., 2022, Beaglehole et al., 2024, Shen et al., 2024, Zhao et al., 21 Oct 2025).