- The paper establishes a framework linking shallow polynomial networks with symmetric tensors to rigorously analyze their optimization landscapes.
- It demonstrates distinct functional regimes and shows conditions where bad local minima exist even when spurious valleys are absent in certain cases.
- By examining teacher-student models under various norms, the study provides explicit gradient expressions and critical point characterizations to guide optimization.
This paper, "Geometry and Optimization of Shallow Polynomial Networks" (2501.06074), investigates shallow neural networks using polynomial activation functions from the perspective of algebraic geometry and tensor theory. The core idea is that the function space of such networks can be identified with sets of symmetric tensors with bounded rank. This connection provides a powerful framework to analyze the optimization landscape, particularly in teacher-student settings, by relating the problem to low-rank tensor approximation under various norms induced by data distributions.
The paper begins by formalizing the connection between shallow polynomial networks and symmetric tensors. A network with degree d polynomial activations and width r, fW​(x)=i=1∑r​αi​(wi​⋅x)d, corresponds directly to a symmetric tensor T=∑i=1r​αi​wi⊗d​ of rank at most r. The functional space Fr​ is precisely the set of symmetric tensors of rank at most r.
A key observation arising from this tensor perspective is the existence of three distinct regimes for the network width r:
- Low-dimensional: For small r, the set Fr​ is a low-dimensional subset of the space of symmetric tensors.
- Thick: For intermediate r, Fr​ is full-dimensional but not the entire space.
- Filling: For large r, Fr​ covers the entire space of symmetric tensors.
The transitions between these regimes are sharp and are determined by rthick​(d,n) and rfill​(d,n), bounds known in tensor theory (related to the Alexander-Hirschovitz Theorem). These different functional space properties imply different optimization behaviors.
The paper also analyzes the parameterization map τr​ from parameter space (α,w1​,…,wr​) to the functional space. The critical locus of this map in parameter space and its image, the branch locus in function space, are important because, for smooth convex loss functions in the functional space, non-global local minima in parameter space must lie in the critical parameter set. For quadratic networks (d=2), the branch locus is shown to be the set of symmetric matrices with rank less than n. For 2D input (n=2), the branch locus corresponds to tensors of rank at most ⌊d/2⌋. This provides a structural property of parameters that can lead to non-global minima.
The paper then discusses the nature of optimization landscapes, defining concepts like "no bad local minima" and "no spurious valleys". It refines previous results on when spurious valleys are absent, showing that for quadratic networks (d=2,r≥n) there are no spurious valleys, and for 2D input (n=2,r≥⌈(d+1)/2⌉) all points are "escapable" (a weaker property). These bounds improve upon general results for the "filling" regime.
However, the paper presents a significant counterexample: for polynomial networks of even degree d, bad local minima can exist for arbitrarily large width r (even in the filling regime) when optimizing a quadratic loss towards a positive polynomial target function. These minima correspond to parameters where all αi​ are negative and all wi​ are zero. Importantly, these bad minima can have basins of attraction with positive Lebesgue measure, implying that gradient-based methods initialized in certain regions will not reach the global minimum. This highlights that the presence of spurious valleys is distinct from the existence of bad local minima and that favorable landscapes in one sense do not guarantee favorability in another.
Teacher-student problems are formulated as minimizing the distance between the student tensor τr​(W) and a fixed teacher tensor τs​(V) in function space, where the distance is measured by a norm. Two types of norms are considered: the Frobenius norm and norms induced by data distributions D. The latter are shown to be quadratic forms defined by the $2d$-th moments of D. Explicit formulas for these moments are provided for different distributions (rotationally invariant/Gaussian, i.i.d., colored Gaussian, mixtures). A key finding is that, in general, distribution-induced norms are distinct from the Frobenius norm.
To understand how the optimization landscape changes with the teacher model and the data distribution, the paper introduces the concept of the teacher-metric discriminant and teacher-data discriminant. Inspired by the focal locus/ED discriminant from algebraic geometry, these are algebraic varieties in the space of teacher tensors and metrics (or teacher tensors and data moments) where the number or nature of critical points of the distance function changes qualitatively. Moving across this discriminant can change the number of local minima.
A detailed case study is presented for networks with quadratic activations (d=2). For quadratic networks, the functional space is the space of symmetric matrices. The problem of finding the best rank-r approximation of a symmetric matrix T is considered for different norms.
- Frobenius Norm: A geometric proof using the focal locus is provided for the classical Eckart-Young theorem, characterizing the (rn​) critical points (obtained by singular value thresholding in the teacher's eigenbasis) and their indices.
- Gaussian Norm: For norms induced by centered Gaussian data, the critical points are also shown to be (rn​) in number and aligned with the teacher's eigenbasis, but the eigenvalues of the critical points are modified compared to the Frobenius case. The indices are also characterized.
- General i.i.d. Norms: In stark contrast, for non-Gaussian i.i.d. distributions, the number of critical points for rank-1 approximation can be exponentially large ((3n−1)/2 in some cases), demonstrating that findings based solely on Gaussian data distributions might not generalize and the optimization landscape can be significantly more complex.
For practical implementation, the paper suggests that understanding the critical locus and branch locus can help identify parameter regions likely to contain non-global minima. The derived gradient expressions for quadratic networks (\Cref{prop:gradient_parameters}) are directly applicable for implementing gradient-based optimization methods. The analysis of critical points for quadratic networks under Frobenius and Gaussian norms provides explicit target tensors and the qualitative nature of their corresponding critical points on the low-rank manifold, which could inform initialization or algorithm design for low-rank approximation tasks under these specific norms. The existence of discriminants suggests that training data distribution is not just about statistical generalization but also fundamentally impacts the computational difficulty of the optimization problem itself. The results on the exponential number of critical points for non-Gaussian data distributions highlight a practical challenge for optimization methods that rely on landscapes similar to the Gaussian case.
Overall, the paper leverages algebraic and differential geometry to provide rigorous insights into the structure of the function space and optimization landscape of shallow polynomial networks, particularly highlighting the critical role of network width and data distribution in determining the complexity of the optimization problem, especially for quadratic networks.