Recursive Feature Machines (RFM)
- Recursive Feature Machines (RFM) are kernel-based algorithms that iteratively reweight input features using the average gradient outer product (AGOP) to build adaptive, interpretable models.
- The methodology leverages convex optimization and a fixed-point recursion to adapt the kernel’s Mahalanobis metric, effectively selecting low-dimensional representations.
- Variants like diagonal, deep, and tree-structured RFMs extend applicability across high-dimensional data and numerous tasks, from tabular regression to molecular modeling.
Recursive Feature Machines (RFM) are a class of kernel-based learning algorithms that enable data-driven, adaptive feature learning by iteratively reweighting input features via the average gradient outer product (AGOP) of the fitted model. Originally motivated by a theoretical analysis of feature learning in deep neural networks, RFMs are designed to transfer the principal mechanism underlying neural feature selection to convex kernel methods, resulting in scalable, interpretable models with competitive performance across a range of domains. The iterative AGOP update protocol allows RFMs to discover relevant low-dimensional representations, mimic neural phenomena such as simple feature bias and grokking, and provide intrinsic measures of feature importance.
1. Feature Learning Mechanism and Theoretical Foundations
The central theoretical premise of RFMs is the Deep Neural Feature Ansatz (NFA), which postulates that feature learning in fully connected neural networks is governed by the AGOP matrix: where is the fitted predictor, and gradients measure functional sensitivity to input perturbations. In a trained neural network, the empirical neural feature matrix for each layer, , is empirically observed to satisfy , indicating that dominant eigenvectors of correspond to upweighted, learned features. This translates into heightened model sensitivity along specific, task-relevant directions in input space, directly motivating RFM's AGOP-driven updates (Radhakrishnan et al., 2022).
The iterative RFM procedure—at each step computing , then reweighting or transforming inputs via a feature matrix derived from —encodes feature selection as a convex operation, bypassing the need for backpropagation or high-dimensional parameterization as in neural networks. This fixed-point recursion can be interpreted as adapting the kernel's Mahalanobis metric to reflect the influence of each input dimension, thereby aligning the Reproducing Kernel Hilbert Space (RKHS) with the learned feature geometry.
2. Algorithmic Architecture and Variants
The canonical RFM algorithm alternates between kernel fitting and AGOP-based feature adaptation. At each iteration, the following workflow is used:
- Compute the kernel matrix using the current Mahalanobis metric .
- Solve kernel ridge regression for coefficients : .
- Define the predictor .
- Estimate the AGOP: .
- Iterate for rounds; output the final (Radhakrishnan et al., 2022).
RFM admits several notable variants:
- Diagonal/coordinate-wise reweighting: Iteratively Reweighted Kernel Machines (IRKM) restrict the metric to diagonal form, efficiently discovering sparse structure in high-dimensional problems (Zhu et al., 13 May 2025).
- Deep RFM: Recursively alternates AGOP-based linear projection with random nonlinear feature maps at each layer, mirroring DNN layerwise denoising and yielding deep collapse phenomena (Beaglehole et al., 2024).
- FACT-RFM: Replaces the NFA-derived AGOP update with the self-consistent weight-variance relation at convergence (the "Features At Convergence Theorem"), providing a more principled update tied to stationarity conditions of neural network training (Boix-Adsera et al., 8 Jul 2025).
- Tree-structured algorithms: xRFM hybridizes RFMs with recursive partitioning, building trees where each leaf runs a locally-tuned RFM, enabling scalability to arbitrarily large datasets (Beaglehole et al., 12 Aug 2025).
This core protocol allows adaptation to a broad class of kernel functions (Laplacian, Gaussian, Matérn, Rational Quadratic), Mahalanobis-weighted distances, and diverse input domains (Shen et al., 2024).
3. Statistical Properties, Complexity, and Scaling
Fundamentally, RFM transforms classical kernel methods—originally fixed-feature—into adaptive learners that mirror sparse and low-dimensional feature learning of neural nets, but with the benefit of convex optimization at every step. The empirical AGOP converges to the true functional sensitivity in high-dimensional or overparametrized regimes, allowing accurate recovery of sparse representations and low-rank subspaces under appropriate regularity conditions (Radhakrishnan et al., 2024, Zhu et al., 13 May 2025).
Scaling RFM to large datasets is tractable:
- Each AGOP computation is ; kernel solves are but can be reduced via stochastic approximations (EigenPro, Falkon), batching, or limiting leaf size in tree-variants.
- For very high-dimensional tasks, axis-aligned (diagonal) AGOPs provide massive computational savings and retain strong empirical performance (Beaglehole et al., 12 Aug 2025).
- Deep RFM and FACT-RFM maintain convexity by design, supporting rapid convergence and interpretability.
Empirical studies confirm that shallow iterations (3–10) suffice for most real-world tasks; further iterations offer diminishing returns once feature matrices stabilize (Radhakrishnan et al., 2022, Shen et al., 2024).
4. Interpretability and Feature Importance
A defining property of RFM is intrinsic interpretability derived from the AGOP matrix. The diagonal entries of the AGOP (or the learned metric ) quantify the model's functional dependence on individual features, yielding direct feature importance measures. Eigenvectors characterize joint influential subspaces and can reveal principal components analogous to neural network first-layer filters.
Feature attributions can be computed both globally (averaged over the data) and locally (per-sample), supporting analyses in fields requiring transparency (e.g., genomics, quantitative structure–property relationships, drug discovery) (Shen et al., 2024). The stability of AGOP-based rankings compares favorably to post-hoc methods such as permutation importance or SHAP, and AGOP structure can diagnose spurious correlations and enable principled pruning.
5. Empirical Benchmarks and Phenomena
RFMs are empirically competitive with, or superior to, state-of-the-art methods in diverse tabular, regression, and molecular modeling benchmarks:
- On >120 tabular classification and regression tasks, RFM achieves top mean test accuracy, PMA, and minimal Friedman rank, outperforming random forest, GBDT, neural networks, fixed-kernel SVMs, and even foundation models at a fraction of the compute (Radhakrishnan et al., 2022, Beaglehole et al., 12 Aug 2025).
- In QSPR (solubility prediction), RFM using hybrid fingerprints and AGOP feature selection surpasses both classical ML and advanced graph neural networks (Shen et al., 2024).
- RFM matches or exceeds neural nets in supporting grokking and phase transitions in modular arithmetic and sparse parity regimes, sharply manifesting the double-descent curve in the test MSE as input dimensionality increases (Mallinar et al., 2024, Gupta et al., 2023).
- Tree-augmented xRFM achieves optimal or near-optimal RMSE across hundreds of large regression datasets, combining AGOP-driven adaptation with GBDT scalability (Beaglehole et al., 12 Aug 2025).
RFMs are robust to redundancy: eliminating highly correlated input features does not significantly degrade performance due to the AGOP's redundancy-filtering properties (Shen et al., 2024).
6. Extensions and Broader Impact
RFM’s core AGOP mechanism is universal: it immediately generalizes to
- Low-rank recovery and matrix completion: Lin-RFM provably generalizes IRLS-type algorithms, supporting exact sparse and low-rank recovery via convex updates, matching or outperforming deep linear networks in both accuracy and runtime (Radhakrishnan et al., 2024).
- Model interpretability and control: In compositional generative models (e.g., autoregressive music), RFM probes extract concept-aligned latent directions via AGOP eigendecomposition. These directions can be injected during inference to steer generative dynamics towards interpretable goals, such as producing specific notes or chords, with explicit trade-offs between fidelity and control (Zhao et al., 21 Oct 2025).
- Feature elimination and selection: Recursive elimination protocols, originally designed for kernel machines, supply uniformly consistent guarantees on identifying the true feature supports under general loss and kernel complexity (Dasgupta et al., 2013).
- Deep representations and collapse: Deep RFM formalizes kernel-based, AGOP-driven layerwise feature collapse, replicating observed neural collapse phenomena and demystifying the emergent rigidity in final-layer neural embeddings (Beaglehole et al., 2024).
7. Discussion and Limitations
While RFM represents a principled, interpretable avenue bridging kernel and neural approaches, several aspects require careful tuning:
- The selection of kernel, initialization, step exponent (for spectral AGOP transformation), and regularization all impact convergence and generalization.
- Scalability to extreme dataset sizes can require approximate kernel solves or recourse to diagonal (axis-aligned) AGOPs.
- Unlike end-to-end neural networks or decision tree ensembles, current RFM variants lack joint global optimization when partitioned hierarchically (such as in xRFM).
RFM provides a tractable mathematical model for analyzing emergence and generalization in feature learning, and continues to anchor advances in interpretable machine learning, phase transition analysis, and data-driven scientific discovery (Radhakrishnan et al., 2022, Zhu et al., 13 May 2025, Beaglehole et al., 12 Aug 2025, Shen et al., 2024, Mallinar et al., 2024, Zhao et al., 21 Oct 2025, Radhakrishnan et al., 2024, Boix-Adsera et al., 8 Jul 2025, Gupta et al., 2023, Dasgupta et al., 2013).