Geometry and Optimization of Shallow Polynomial Networks

Published 10 Jan 2025 in cs.LG and math.AG | (2501.06074v1)

Abstract: We study shallow neural networks with polynomial activations. The function space for these models can be identified with a set of symmetric tensors with bounded rank. We describe general features of these networks, focusing on the relationship between width and optimization. We then consider teacher-student problems, that can be viewed as a problem of low-rank tensor approximation with respect to a non-standard inner product that is induced by the data distribution. In this setting, we introduce a teacher-metric discriminant which encodes the qualitative behavior of the optimization as a function of the training data distribution. Finally, we focus on networks with quadratic activations, presenting an in-depth analysis of the optimization landscape. In particular, we present a variation of the Eckart-Young Theorem characterizing all critical points and their Hessian signatures for teacher-student problems with quadratic networks and Gaussian training data.

Abstract PDF Upgrade to Chat

Summary

The paper establishes a framework linking shallow polynomial networks with symmetric tensors to rigorously analyze their optimization landscapes.
It demonstrates distinct functional regimes and shows conditions where bad local minima exist even when spurious valleys are absent in certain cases.
By examining teacher-student models under various norms, the study provides explicit gradient expressions and critical point characterizations to guide optimization.

This paper, "Geometry and Optimization of Shallow Polynomial Networks" (2501.06074), investigates shallow neural networks using polynomial activation functions from the perspective of algebraic geometry and tensor theory. The core idea is that the function space of such networks can be identified with sets of symmetric tensors with bounded rank. This connection provides a powerful framework to analyze the optimization landscape, particularly in teacher-student settings, by relating the problem to low-rank tensor approximation under various norms induced by data distributions.

The paper begins by formalizing the connection between shallow polynomial networks and symmetric tensors. A network with degree $d$ polynomial activations and width $r$ , $f_{\mathcal W}(x) = \sum_{i=1}^r \alpha_i (w_i \cdot x)^d$ , corresponds directly to a symmetric tensor $T = \sum_{i=1}^r \alpha_i w_i^{\otimes d}$ of rank at most $r$ . The functional space $\mathcal{F}_r$ is precisely the set of symmetric tensors of rank at most $r$ .

A key observation arising from this tensor perspective is the existence of three distinct regimes for the network width $r$ :

Low-dimensional: For small $r$ , the set $\mathcal{F}_r$ is a low-dimensional subset of the space of symmetric tensors.
Thick: For intermediate $r$ , $\mathcal{F}_r$ is full-dimensional but not the entire space.
Filling: For large $r$ , $\mathcal{F}_r$ covers the entire space of symmetric tensors.

The transitions between these regimes are sharp and are determined by $r_{\rm thick}(d,n)$ and $r_{\rm fill}(d,n)$ , bounds known in tensor theory (related to the Alexander-Hirschovitz Theorem). These different functional space properties imply different optimization behaviors.

The paper also analyzes the parameterization map $\tau_r$ from parameter space $(\alpha, w_1, \ldots, w_r)$ to the functional space. The critical locus of this map in parameter space and its image, the branch locus in function space, are important because, for smooth convex loss functions in the functional space, non-global local minima in parameter space must lie in the critical parameter set. For quadratic networks ( $d=2$ ), the branch locus is shown to be the set of symmetric matrices with rank less than $n$ . For 2D input ( $n=2$ ), the branch locus corresponds to tensors of rank at most $\lfloor d/2 \rfloor$ . This provides a structural property of parameters that can lead to non-global minima.

The paper then discusses the nature of optimization landscapes, defining concepts like "no bad local minima" and "no spurious valleys". It refines previous results on when spurious valleys are absent, showing that for quadratic networks ( $d=2, r \ge n$ ) there are no spurious valleys, and for 2D input ( $n=2, r \ge \lceil (d+1)/2 \rceil$ ) all points are "escapable" (a weaker property). These bounds improve upon general results for the "filling" regime.

However, the paper presents a significant counterexample: for polynomial networks of even degree $d$ , bad local minima can exist for arbitrarily large width $r$ (even in the filling regime) when optimizing a quadratic loss towards a positive polynomial target function. These minima correspond to parameters where all $\alpha_i$ are negative and all $w_i$ are zero. Importantly, these bad minima can have basins of attraction with positive Lebesgue measure, implying that gradient-based methods initialized in certain regions will not reach the global minimum. This highlights that the presence of spurious valleys is distinct from the existence of bad local minima and that favorable landscapes in one sense do not guarantee favorability in another.

Teacher-student problems are formulated as minimizing the distance between the student tensor $\tau_r(\mathcal{W})$ and a fixed teacher tensor $\tau_s(\mathcal{V})$ in function space, where the distance is measured by a norm. Two types of norms are considered: the Frobenius norm and norms induced by data distributions $\mathcal{D}$ . The latter are shown to be quadratic forms defined by the $2d$-th moments of $\mathcal{D}$ . Explicit formulas for these moments are provided for different distributions (rotationally invariant/Gaussian, i.i.d., colored Gaussian, mixtures). A key finding is that, in general, distribution-induced norms are distinct from the Frobenius norm.

To understand how the optimization landscape changes with the teacher model and the data distribution, the paper introduces the concept of the teacher-metric discriminant and teacher-data discriminant. Inspired by the focal locus/ED discriminant from algebraic geometry, these are algebraic varieties in the space of teacher tensors and metrics (or teacher tensors and data moments) where the number or nature of critical points of the distance function changes qualitatively. Moving across this discriminant can change the number of local minima.

A detailed case study is presented for networks with quadratic activations ( $d=2$ ). For quadratic networks, the functional space is the space of symmetric matrices. The problem of finding the best rank- $r$ approximation of a symmetric matrix $T$ is considered for different norms.

Frobenius Norm: A geometric proof using the focal locus is provided for the classical Eckart-Young theorem, characterizing the $\binom{n}{r}$ critical points (obtained by singular value thresholding in the teacher's eigenbasis) and their indices.
Gaussian Norm: For norms induced by centered Gaussian data, the critical points are also shown to be $\binom{n}{r}$ in number and aligned with the teacher's eigenbasis, but the eigenvalues of the critical points are modified compared to the Frobenius case. The indices are also characterized.
General i.i.d. Norms: In stark contrast, for non-Gaussian i.i.d. distributions, the number of critical points for rank-1 approximation can be exponentially large ( $(3^n-1)/2$ in some cases), demonstrating that findings based solely on Gaussian data distributions might not generalize and the optimization landscape can be significantly more complex.

For practical implementation, the paper suggests that understanding the critical locus and branch locus can help identify parameter regions likely to contain non-global minima. The derived gradient expressions for quadratic networks (\Cref{prop:gradient_parameters}) are directly applicable for implementing gradient-based optimization methods. The analysis of critical points for quadratic networks under Frobenius and Gaussian norms provides explicit target tensors and the qualitative nature of their corresponding critical points on the low-rank manifold, which could inform initialization or algorithm design for low-rank approximation tasks under these specific norms. The existence of discriminants suggests that training data distribution is not just about statistical generalization but also fundamentally impacts the computational difficulty of the optimization problem itself. The results on the exponential number of critical points for non-Gaussian data distributions highlight a practical challenge for optimization methods that rely on landscapes similar to the Gaussian case.

Overall, the paper leverages algebraic and differential geometry to provide rigorous insights into the structure of the function space and optimization landscape of shallow polynomial networks, particularly highlighting the critical role of network width and data distribution in determining the complexity of the optimization problem, especially for quadratic networks.

Markdown Report Issue