Papers
Topics
Authors
Recent
Search
2000 character limit reached

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Learnable Channel Attention

Published 23 Dec 2025 in stat.ML, cs.LG, and math.OC | (2512.20562v1)

Abstract: We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = Θ(1) \ge 1$ defined on the unit sphere in $\RRd$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width of $m \ge Θ({n4 \log (2n/δ)}/{d{2\ell_0}})$ trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp Θ(d{\ell_0}/\eps)$ with probability $1-δ$ for every $δ\in (0,1)$, in contrast with the representative sample complexity $Θ\pth{d{\ell_0} \max\set{\eps{-2},\log d}}$, where $n$ is the training daata size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $Θ(d{\ell_0}/{n})$ with probability at least $1-δ$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Θ(d{\ell_0})$ is $Θ(d{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. The training of the two-layer NN with channel attention consists of two stages. In Stage 1, a provable learnable channel selection algorithm identifies the ground-truth channel number $\ell_0$ from the initial $L \ge \ell_0$ channels in the first-layer activation, with high probability. This learnable selection is achieved by an efficient one-step GD update on both layers, enabling feature learning for low-degree polynomial targets. In Stage 2, the second layer is trained by standard GD using the activation function with the selected channels.

Summary

  • The paper establishes that a two-layer, overparameterized shallow neural network with trainable channel attention attains minimax-optimal regression rates for low-degree spherical polynomials.
  • It introduces a two-stage training protocol where one-step gradient descent and thresholding precisely select informative spherical harmonic channels.
  • The approach bridges kernel regression theory and feature learning, enabling finite-width networks to outperform traditional NTK methods with improved sample complexity.

Minimax-Optimal Learning of Low-Degree Spherical Polynomials by Shallow Neural Networks with Learnable Channel Attention

Overview

The paper "Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Learnable Channel Attention" (2512.20562) advances the theoretical understanding of nonparametric regression on spheres by overparameterized shallow neural networks with explicit channel attention. It rigorously demonstrates that a two-layer network with a provable, learnable attention mechanism can efficiently perform feature selection corresponding to the harmonic degree of the target polynomial, and—critically—attain minimax-optimal generalization rates for arbitrary low-degree spherical polynomials.

Problem Formulation

The central task is regression on the unit sphere Sd−1\mathbb{S}^{d-1}, where the ground truth function f∗f^* is a spherical polynomial of degree ℓ0=Θ(1)\ell_0 = \Theta(1). Training data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n comprise i.i.d. features xix_i (uniformly sampled from the sphere) with noisy labels yi=f∗(xi)+wiy_i = f^*(x_i) + w_i, sub-Gaussian additive noise. The objective is to construct an estimator f^\widehat{f} with population mean square error R(f^)=Ex[(f^(x)−f∗(x))2]R(\widehat{f}) = \mathbb{E}_{x}[(\widehat{f}(x) - f^*(x))^2] diminishing at the sharp minimax-optimal rate Θ(dℓ0/n)\Theta(d^{\ell_0}/n).

Network Architecture and Channel Attention Mechanisms

The method uses a two-layer neural network parameterized by a fixed set of random weights in the first layer and learnable channel attention τ=(τ0,…,τL)\tau = (\tau_0, \ldots, \tau_L). Each "channel" corresponds to the activation associated with degree-ℓ\ell spherical harmonics, implemented via the addition of Gegenbauer polynomials Pℓ(d)(⟨x,x′⟩)P_\ell^{(d)}(\langle x, x' \rangle) with a learnable attention weight per degree. The finite width mm is shown to have explicit size requirements.

Channel attention is selected adaptively through a two-stage training protocol:

  1. Stage 1: One-step GD is applied to all channel attention weights, followed by a thresholding rule that provably selects exactly the ℓ0\ell_0 informative channels (those present in f∗f^*) with high probability, provided a mild minimum signal condition on the coefficients of f∗f^*.
  2. Stage 2: With these identified channels fixed, the remaining second-layer network weights are trained by standard gradient descent on squared error.

This procedure adaptively restricts the effective kernel (network's NTK) to a low-rank, degree-â„“0\ell_0-polynomial space.

Main Theoretical Results

The paper establishes several strong claims under clear, verifiable conditions:

  • For any regression risk ε>0\varepsilon > 0, only n=Θ(dâ„“0/ε)n = \Theta(d^{\ell_0}/\varepsilon) samples are needed for error at most ε\varepsilon, which matches the information-theoretic minimax lower bound for the corresponding RKHS class.
  • The risk rate Θ(dâ„“0/n)\Theta(d^{\ell_0}/n) is unimprovable (minimax optimal) for functions in this spherical harmonic RKHS, and is achieved by training with finite (not infinite) width mm (where mm scales polynomially in nn and dd).
  • The sample complexity is strictly better—by polynomial factors—than existing techniques for learning low-degree polynomials (e.g., NTK, QuadNTK), both in risk sharpness and in not requiring infinite-width or pointwise bounds.
  • The two-stage algorithm provably identifies the correct number of informative channels (â„“0\ell_0) with high probability, when the minimum absolute coefficient assumption holds.
  • Their analysis leverages a combination of precise kernel complexity (empirical and population), local Rademacher complexity techniques, and new uniform convergence bounds on the empirical NTK.

Algorithmic Innovations

A central contribution is the explicit, learnable channel selection algorithm (provable "degree selection"), which identifies all and only those degrees present in the ground-truth f∗f^*. The design combines:

  • Efficient computation through dynamic programming for Gegenbauer polynomials,
  • One-step gradient update and thresholding for attention weight selection, with robust guarantees,
  • Activation to the NTK regime post selection.

This process ensures that after selection, the effective optimization and generalization analysis reduces to kernel regression with a rank-r0=Θ(dℓ0)r_0 = \Theta(d^{\ell_0}) NK.

Novel Analytical Techniques

Key technical advances include:

  • Sharp finite-sample local Rademacher complexity bounds adapted to the low-rank spherical polynomial regime. These yield tight risk bounds for the composed network class—functions represented as a sum of a bounded-norm kernel function and small error.
  • Uniform convergence analysis of the random feature NTK kernel in both sup-norm and operator norm, removing dependence on Hölder continuity and enabling direct high-probability control over the network class.
  • Explicit separation of the optimization trajectory into useful subspace (informative channels) and negligible error, controlling each with matching rates.

Implications and Broader Impact

Practical Implications

  • Provides a practical pipeline for regression on spheres where the ground truth is governed by a (potentially unknown) harmonic degree, and the network must adaptively discover the relevant subspace among exponentially many possibilities.
  • The algorithm is computationally efficient: combinatorially avoids searching all degree-subspaces or exhaustive model selection, thanks to the differentiable and thresholdable attention mechanism.
  • The optimal sample complexity result can guide practitioners in settings with high-dimensional covariates and low-complexity ground truth.

Theoretical Implications

  • Demonstrates—contrary to the oft-claimed necessity of infinite width or purely kernel/NTK-based linearization arguments—that finite-width, overparameterized shallow networks with explicit feature learning and attention are sufficient for minimax-optimal nonparametric estimation in this regime.
  • Bridges the gap between kernel regression theory (minimax rates tied to kernel rank) and adaptive neural network learning, making explicit the statistical utility of learnable attention components as feature selectors.
  • Provides an analytical template for generalization and risk analysis in shallow NNs with structured attention, extensible to more general harmonic analysis or other orthogonal bases.

Potential for Future Work

  • The approach suggests extension to higher degrees or more general manifold domains (notably zonal harmonics or other group-invariant polynomials).
  • Opens the way for deeper investigation into learnable attention as an explicit mechanism for adaptive model selection and dimension reduction within overparameterized NNs.
  • Provides concrete support for leveraging trainable attention in structured data domains beyond the sphere, potentially in graph-structured or manifold-based learning scenarios.
  • It would be meaningful to investigate the empirical robustness of such two-stage attention-based procedures under various noise distributions or sampling schemes.

Conclusion

This paper delivers a rigorous and technically nuanced analysis establishing that overparameterized shallow neural networks with adaptive channel attention—coupled with an explicit trainable selection algorithm—can realize the minimax lower bound for learning low-degree spherical polynomials. The architecture achieves precise, optimal sample complexity, surpasses the linear NTK regime, and, most notably, does so with finite width and a practical training algorithm. These insights tightly connect the expressivity and adaptivity of neural networks with classical risk minimization theory, highlighting the substantial theoretical benefits of explicit, learnable attention mechanisms in the context of regression on manifolds (2512.20562).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 23 likes about this paper.