Learning Multi-Index Models with Hyper-Kernel Ridge Regression

Published 2 Oct 2025 in stat.ML and cs.LG | (2510.02532v1)

Abstract: Deep neural networks excel in high-dimensional problems, outperforming models such as kernel methods, which suffer from the curse of dimensionality. However, the theoretical foundations of this success remain poorly understood. We follow the idea that the compositional structure of the learning task is the key factor determining when deep networks outperform other approaches. Taking a step towards formalizing this idea, we consider a simple compositional model, namely the multi-index model (MIM). In this context, we introduce and study hyper-kernel ridge regression (HKRR), an approach blending neural networks and kernel methods. Our main contribution is a sample complexity result demonstrating that HKRR can adaptively learn MIM, overcoming the curse of dimensionality. Further, we exploit the kernel nature of the estimator to develop ad hoc optimization approaches. Indeed, we contrast alternating minimization and alternating gradient methods both theoretically and numerically. These numerical results complement and reinforce our theoretical findings.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Hyper-Kernel Ridge Regression (HKRR), which integrates classical kernel methods with neural network components to learn Multi-Index Models.
It provides a detailed theoretical analysis of sample complexity, demonstrating exponential sample size dependency on transformation dimension and polynomial dependency on input dimension.
Experimental results show that HKRR, optimized via VarPro and AGD, outperforms traditional methods by effectively mitigating the curse of dimensionality.

Learning Multi-Index Models with Hyper-Kernel Ridge Regression

Introduction

The paper "Learning Multi-Index Models with Hyper-Kernel Ridge Regression" (2510.02532) introduces a methodology that combines classical kernel methods with concepts from neural networks, aiming at learning Multi-Index Models (MIMs). This approach, termed Hyper-Kernel Ridge Regression (HKRR), exploits the compositional structure of data where a linear transformation is followed by a nonlinear function, allowing adaptive learning that circumvents the curse of dimensionality typically associated with high-dimensional data analysis.

Theoretical Framework

Sample Complexity Analysis:

HKRR is designed to perform well in high-dimensional spaces by learning MIMs whose structure involves a linear data transformation followed by a smooth nonlinearity. The paper provides theoretical insights into the sample complexity of HKRR, showing that it adapts to the transformation dimension rather than the input dimension. Specifically, the dependency on sample size through dimensionality is exponential in the transformation dimension divided by smoothness and polynomial in the input dimension, thereby addressing the curse of dimensionality efficiently.

Excess Risk Bound:

Under the assumption that the target function resides within a specified smooth subspace, the paper derives an excess risk bound. This shows that HKRR achieves an approximation error rate proportional to $m^{-\theta \zeta}$ , where $\zeta$ relates to smoothness and dimensionality ratio, and $\theta$ is a smoothness parameter.

Figure 1: R2 score on test sets for B and alpha learned by VarPro and AGD, illustrating performance differences across various model sizes.

Optimization Algorithms

Variable Projection (VarPro) vs. Alternating Gradient Descent (AGD):

The optimization problem in HKRR is inherently non-convex due to the adjustable linear transformation matrix. Two algorithms are proposed to solve it: VarPro and AGD.

VarPro utilizes closed-form updates for certain steps and gradient descent for others, exploiting problem structure, but may converge to local minima. It can be more computationally efficient for small data samples due to closed-form solutions.
AGD applies gradient updates alternatively and is more robust against local minima, better exploring the optimization landscape.
Figure 2: Convergence map indicating stability differences between VarPro and AGD methods, depicting global and local convergence zones.

Numerical Experiments

In the numerical section, HKRR surpasses traditional kernel methods by exploiting compositional structure and reducing dimensionality impact. Results highlight:

Robustness in choosing transformation dimension $d$ , indicating performance benefits even when overestimating the dimensionality.
Importance of initialization and parameter tuning in ensuring convergence to effective solutions, with AGD showing more consistent results across varied scenarios.
Figure 3: Convergence map of VarPro and AGD on a benchmark function, exemplifying differential method outcomes in convergence behavior.

Conclusion

HKRR provides a compelling framework that not only blends kernels and neural networks but also adapts these techniques to data structures that are compositional in nature. By doing so, it highlights a pathway to mitigating dimensionality-related challenges in machine learning, offering both theoretical understanding and practical tools to enhance model learning efficiency. Future work could focus on extending these principles to broader compositional structures beyond MIMs and refining algorithmic performance even further.

Markdown Report Issue