VarPro & Over-Parameterization

Updated 30 January 2026

The paper introduces a method that partitions parameters into linear and nonlinear blocks to analytically eliminate one block, improving convergence and conditioning.
It employs Gauss-Newton and other analytic elimination techniques to reduce the optimization problem's complexity, enhancing efficiency in various models.
Applications include deep neural networks, structured low-rank approximations, time series analysis, polynomial GCD problems, and non-smooth regularized regressions.

A variable projection scheme based on over-parameterization refers to an optimization methodology that exploits problem structure by partitioning parameters into blocks (typically linear and nonlinear or coupled and decoupled) and then marginalizing—i.e., analytically eliminating or solving for—one block given the other, usually enabled by smooth convexity and/or low-dimensionality of the “inner” block. Over-parameterization refers to the use of redundant or expanded parameterizations that provide flexibility, smoothness, or improved conditioning for the overall optimization process. This approach has seen rigorous development in DNN training, structured low-rank approximation, time series analysis, polynomial common divisor problems, tensor algebra, and non-smooth regularized regression.

1. Key Concepts and Structural Formulation

Variable projection (VarPro) originated for separable nonlinear least squares, in which an objective

$\min_{w, v}\quad \Phi(w, v) = \frac{1}{N}\sum_{i=1}^N L\big(V\,F(x_i; W), c_i\big) + R(W) + S(V)$

is split such that $W$ parameterizes a nonlinear block (e.g., DNN feature extractor) and $V$ a linear block (e.g., last layer, linear weights) (Newman et al., 2020). Over-parameterization arises by treating $W$ and $V$ as independent and then explicitly minimizing over $V$ for fixed $W$ :

$V(W) = \arg\min_V\,\Phi(W, V)$

yielding a reduced objective

$\Phi_{\mathrm{red}}(W) = \Phi(W, V(W))$

Optimization then proceeds in $W$ only, with $V$ kept in analytic equilibrium. This structure generalizes across domains (e.g., polynomial quotient/divisor form (Usevich et al., 2013), tensor transformation/representation (Newman et al., 2024), low-rank signal subspaces (Zvonarev et al., 2021), structured matrix approximation (Usevich et al., 2012), non-smooth convex problems (Poon et al., 2022)).

The over-parameterized aspect typically refers to expanding the parameter space so that analytic or semi-analytic elimination is feasible and the remaining variables enjoy better curvature or conditioning (e.g., eliminating vast numbers of linear coefficients while only iterating nonlinear weights).

2. Algorithmic Schemes and Computational Properties

Gauss-Newton Variable Projection (GNvpro)

In DNN training (Newman et al., 2020), the GNvpro method involves:

Inner solve: For least-squares loss, $V(W)$ has a closed-form solution via normal equations; for cross-entropy, a small Newton-Krylov trust-region solver solves for $V$ .
Outer solve: Compute gradients, Jacobian-vector products, and Hessians (using the Gauss-Newton approximation) only with respect to $W$ , exploiting the vanishing gradient property of $V$ due to optimality.
Trust-region iteration: Krylov subspace methods (Arnoldi/GMRES) optimize step $\delta$ in $W$ within trust-region constraints.

This results in accelerated convergence (tens of passes, versus thousands for SGD), minimal cost for the inner block, and improved stability due to reduced coupling ill-conditioning.

Structured SLRA and Polynomial GCD

Structured low-rank approximations (affine mosaics, Hankel, Sylvester matrices) (Usevich et al., 2013, Usevich et al., 2012) employ over-parameterized representations—partitioning parameters into linear and nonlinear/annihilator blocks and eliminating the linear part using least-squares/least-norm solutions, yielding reduced cost functions with closed-form gradients and efficient block-banded Cholesky solves.

In approximate GCD computations, the image representation seeks polynomials $(p_1, ..., p_N) = (h g_1, ..., h g_N)$ , eliminates $g_i$ for fixed $h$ , and optimizes a reduced cost $\varphi(h)$ with guaranteed linear complexity for small/large degree regimes (Usevich et al., 2013).

Time Series and Signal Subspaces

Low-rank signal subspaces parameterized via Generalized Linear Recurrence Relations (GLRRs) exploit over-parameterization $(s, \theta)$ to represent constraints, with variable projection and FFT-regularized basis construction for stable, low-cost projection and gradient computation (Zvonarev et al., 2021).

Non-Smooth Structured Regression

Over-parameterized Hadamard (U–V) formulations rewrite non-smooth penalties (e.g., group-lasso, TV) into smooth, albeit nonconvex, quadratic forms. Variable projection eliminates one variable block, yielding a smooth objective for quasi-Newton optimization, with dimension-free convergence rates and no need for parameter tuning (Poon et al., 2022).

3. Theoretical Properties: Convergence, Conditioning, and Uniqueness

Variable projection with over-parameterization often improves the conditioning of the Hessian of the reduced objective by separating out ill-conditioned (typically linear) variables. Strict convexity or affine-linear structure in the eliminated block ensures that the reduced problem is well-posed (Newman et al., 2020, Usevich et al., 2012). This results in a lower-dimensional and smoother effective nonlinear optimization.

Convergence guarantees hold under mild regularity; e.g., in tensor frameworks, the Riemannian gradient method converges at $O(1/\varepsilon^2)$ (Newman et al., 2024). For GLRR signal models, diffeomorphic local parameterization ensures existence of stable projections (Zvonarev et al., 2021). The structure of the reduced cost prevents “peaking” phenomena observed in naive over-parameterized regressions (Huang et al., 2020).

Where uniqueness is analyzable, e.g., in tensor transforms (optimal $t$ -SVD decompositions), solutions are unique up to row permutations/sign flips (Newman et al., 2024).

4. Applications Across Domains

Table: Representative Applications

Domain	Variable Partition	Projection Task
Deep Neural Networks (Newman et al., 2020)	(W: nonlinear, V: linear)	Optimize feature extractor, eliminate affine head
Structured Polynomials (Usevich et al., 2013)	(h: divisor, g: quotient)	Approximate GCD via analytic elimination
Low-Rank Approximation (Usevich et al., 2012)	(R: annihilator, p: parameters)	Structured low-rank matrix projection
Signal Subspace (Zvonarev et al., 2021)	(s, θ: GLRR parameters)	Time-series denoising, energy subspace modeling
Tensor Decomposition (Newman et al., 2024)	(M: transform, X: tensor)	Learn matrix-mimetic tensor algebras
Non-smooth Regression (Poon et al., 2022)	(u: inner, v: scaling)	Group Lasso, TV, square-root Lasso, ℓ_q recovery

Surrogate modeling with PDEs, hyperspectral segmentation, and image classification tasks benefit from GNvpro’s efficiency and superior generalization relative to stochastic gradient descent (Newman et al., 2020).

PCA-OLS and variant projection estimators in high-dimensional regression regularize and yield robust estimators, avoiding adversarial susceptibility in the overparameterized regime (Huang et al., 2020).

Matrix-mimetic tensor regression and compression achieve optimal transforms and representations, outperforming fixed heuristics such as DCT and vanilla SVD, with direct interpretability and generalization (Newman et al., 2024).

5. Computational Complexity and Implementation Strategies

The central computational advantage is the potential to marginalize the (typically) linear, convex block in closed form or via fast solvers, resulting in cost savings that scale linearly or near-linearly in problem size. For deep nets, the inner affine block is small compared to nonlinear layers, making its elimination cheap regardless of depth/width (Newman et al., 2020). Structured low-rank mosaic Hankel or block-banded matrices allow efficient Cholesky or FFT-based projections (Usevich et al., 2012, Zvonarev et al., 2021).

Outer nonlinear variable optimization is now only exposed to the smooth reduced objective, tractable by trust-region, conjugate gradient, L-BFGS, or Riemannian gradient descent (for orthogonal manifolds).

Parameter selection (e.g., number of components $k$ in PCA-OLS) becomes a smooth, variable selection/projection task, with cross-validation or spectral-gap heuristics guiding dimension choice (Huang et al., 2020).

6. Limitations and Open Considerations

Over-parameterization via variable projection relies on strict convexity/smoothness of the inner block; e.g., least-squares and cross-entropy with $\ell_2$ regularization satisfy this, while combinations such as ReLU + $\ell_1$ require modification (Newman et al., 2020). Matrix conditioning—absence of roots on the unit circle—is essential for efficient projection, particularly in FFT-based schemes (Usevich et al., 2013, Zvonarev et al., 2021).

A plausible implication is that while over-parameterization generally accelerates optimization by decoupling and smoothing, problem-specific algebraic singularities or poorly conditioned underlying linear systems may limit the effectiveness of this strategy.

7. Impact and Future Directions

Variable projection schemes grounded in over-parameterization have reshaped the effective practice and theory of high-dimensional function approximation, signal estimation, tensor decomposition, and structured regression. By enabling analytic elimination of vast blocks of parameters, these approaches deliver improved convergence, robustness, statistical generalization, and direct interpretability. Their application in matrix-mimetic tensor methods and DNN optimization foreshadows further integration into scalable machine learning and computational mathematics workflows—particularly as new structured models and problem representations emerge requiring flexible, smooth, and efficiently marginalizable parameterizations.

Empirical and theoretical advances across the referenced domains suggest continuing expansion and refinement, with further exploration warranted in nonconvex and non-smooth settings, manifold-constrained optimization, and broader regularized inverse problem classes.