Diagonal Linear Networks (DLNs) Analysis

Updated 18 February 2026

Diagonal Linear Networks are neural architectures with linear activations and strictly diagonal parameter matrices that use coordinate-wise Hadamard multiplication for nonconvex parameterization.
Their gradient dynamics reveal mirror flows and implicit regularization effects that interpolate between ℓ1 and ℓ2 norms, significantly influencing convergence and sparsity.
Practical implications include applications in sparse recovery, incremental learning, and linear programming, with extensions to structured RNNs and high-dimensional optimization analyses.

Diagonal Linear Networks (DLNs) are a class of neural architectures characterized by linear activations and strictly diagonal parameter matrices in each layer. These networks generalize the standard linear model via a nonconvex, coordinate-wise parameterization—most commonly as the Hadamard (elementwise) product of parameter vectors across layers. Despite their simplicity, DLNs have emerged as a canonical setting for analyzing implicit regularization phenomena, convergence properties, and optimization-induced selection mechanisms across a wide spectrum of learning theory and optimization research. Their tractable yet expressive form enables sharp characterization of gradient dynamics, connections with $\ell_1$ - and $\ell_p$ -minimization, and a unified view on initialization dependence, incremental learning, and algorithmic bias.

1. Architecture and Parameterization

A $L$ -layer Diagonal Linear Network represents the predictor as

$\theta = w^{(1)} \odot w^{(2)} \odot \dotsb \odot w^{(L)} \in \mathbb{R}^d\ ,$

where each $w^{(\ell)} \in \mathbb{R}^d$ and $\odot$ denotes componentwise (Hadamard) multiplication. The network output for input $x \in \mathbb{R}^d$ is $f(x; w) = \langle x, \theta \rangle$ , with square loss or other convex objectives as the training criterion. When $L=2$ , the parameterization is $\theta = u \odot v$ , with $u, v \in \mathbb{R}^d$ .

More general parameterizations include sign-decompositions (e.g., $\theta = w_+^2 - w_-^2$ as in (Pesme et al., 2021)), or powers/depth with homogeneity exponent $p$ (e.g., $\psi_{\theta} = \theta_+^p - \theta_-^p$ (Wind et al., 2023)). The parameterization induces a nonconvex landscape over $(w^{(1)}, \dotsc, w^{(L)})$ despite the final linearity in $x$ .

In DLN-based linear programming approaches, constraints or objectives may further be absorbed into the diagonally-structured reparameterization (e.g., $x = u \circ u$ for $x\ge0$ (Wang et al., 2023)). Diagonal RNNs employ the recurrence $h_t = \Lambda_t h_{t-1} + B_t x_t$ with diagonal $\Lambda_t$ , and can be extended to FA architectures via fixed-point iteration (Movahedi et al., 13 Mar 2025).

2. Gradient Dynamics and Optimization-Induced Bias

Mirror Flow and Bregman Potentials

Continuous-time limit analysis of DLN training reveals that the effective predictor $\theta(t)$ follows a mirror flow under a convex potential $Q$ determined by the parameterization and initialization:

$\frac{d}{dt} [\nabla Q(\theta(t))] + \nabla L(\theta(t)) = 0\ ,$

where $L(\theta)$ is the loss (Labarrière et al., 2024, Papazov et al., 2024). For $L=2$ , the potential $Q$ admits the "hyperbolic entropy" form:

$Q(\theta) = \frac{1}{4} \sum_{i=1}^d \left[ 2 \theta_i \,\operatorname{arcsinh}\!\frac{2\theta_i}{\Delta_i} - \sqrt{4\theta_i^2 + \Delta_i^2} + \Delta_i \right]$

with $\Delta_i$ set by the initial parameters. Gradient flow thus selects, among all interpolators (e.g., $X \theta = y$ ), the one minimizing $Q$ —leading to an implicit, data-independent regularization (Labarrière et al., 2024, Papazov et al., 2024).

Depth, Initialization, and Bias Regimes

The implicit regularizer $Q$ interpolates between:

$\ell_1$ -norm for vanishing initialization ("rich" regime): strong sparsity bias.
$\ell_2$ -norm or minimum norm for large initialization ("lazy" or NTK regime).

Network depth $L$ further modulates bias: deep DLNs (homogeneity $p>2$ ) select among $\ell_1$ -minimizers the one with large $\ell_{2/p}$ -norm (hence more spread), while shallow ( $p=2$ ) select the maximum-entropy solution (Wind et al., 2023). Calibration of initialization to a target $p_\mathrm{eff}$ allows controlled interpolation between extreme regimes (Zhang et al., 25 Sep 2025).

3. Explicit Characterization of Training Trajectories and Regularization Path

Connection with Lasso Path

Full gradient-flow DLN solutions (with infinitesimal initialization) converge to the minimum- $\ell_1$ interpolator (Berthier, 23 Sep 2025). Remarkably, the entire time-averaged DLN training trajectory retraces the Lasso ( $\ell_1$ -regularized least squares) solution path as a function of an effective regularization parameter determined by the training time:

$\bar x^\varepsilon(t) = \frac{1}{t} \int_0^t x^\varepsilon(u)\,du \to x(s),\quad s = \frac{2 t}{\log(1/\varepsilon) } \implies \lambda=1/s [2509.18766].$

Under a monotonicity condition on the regularization path, this correspondence is exact. Early stopping in DLN training thus acts as an implicit $\ell_1$ regularization, with the effective penalty controlled by training time (Berthier, 23 Sep 2025, Berthier, 2022).

Saddle-to-Saddle Dynamics and Incremental Learning

Vanishing-initialization DLN gradient flow exhibits "saddle-to-saddle" dynamics, sequentially jumping between faces of the loss constrained to active coordinate sets—mirroring the LARS algorithm for Lasso homotopy (Pesme et al., 2023). Each jump corresponds to incrementally adding features to the active support (Berthier, 2022). In overparameterized regimes, this process continues until attaining the unique minimum- $\ell_1$ solution; in underparameterized/anti-correlated settings, support grows monotonically over time (Berthier, 2022).

4. Algorithmic Effects: Stochasticity, Momentum, and Sharpness-Aware Perturbations

Stochastic Gradient Dynamics

Stochastic gradient descent (SGD) amplifies the implicit $\ell_1$ bias relative to gradient flow (GF), due to an effectively reduced initialization scale $\alpha_\mathrm{eff}$ (Pesme et al., 2021, Even et al., 2023). The degree of this amplification is inversely related to the convergence rate; slower training increases the bias towards sparser solutions. In the "edge of stability" (large stepsize), SGD retains homogeneous bias, favoring recovery of support, whereas GD develops "heterogeneous" weights penalizing the true support, leading to failure in sparse recovery (Even et al., 2023). Experimental evidence confirms improved validation loss and recovery for SGD in sparse regression regimes (Pesme et al., 2021, Even et al., 2023).

Momentum and Intrinsic Acceleration Parameter

Momentum gradient descent with step size $\gamma$ and parameter $\beta$ is governed in the continuous-time limit by $\lambda = \gamma / (1-\beta)^2$ , which uniquely determines the optimization trajectory modulo time scaling (Papazov et al., 2024). For small $\lambda$ , the final solution exhibits lower balancedness $\Delta_\infty$ and stronger sparsity (i.e., smaller $\ell_1$ -norm) than simple gradient flow. Tuning $\lambda$ thus enables control of both speed of convergence and implicit regularization strength (Papazov et al., 2024).

Sharpness-Aware and Noisy Perturbations

Stochastic Sharpness-Aware Minimization (S-SAM) introduces isotropic Gaussian noise into DLN weights at each step, yielding a regularizer equal to the average sharpness of the loss landscape (Clara et al., 14 Mar 2025). This imposes "balancing" across the diagonal factors—minimizing both PAC–Bayes average sharpness and the Hessian trace—and drives the iterates toward a soft-thresholded shrinkage of the true parameter. The shrinkage factor depends polynomially on the noise, with deeper networks and larger batch noise accelerating convergence to balanced, low-sharpness regions (Clara et al., 14 Mar 2025).

5. Extensions: Linear Programming, RNNs, and Structured Optimization

DLNs have been adapted as solvers for linear programming and basis pursuit via gradient descent on a nonconvex parameterization of the feasible set (e.g., $x = u \circ u$ for positivity) (Wang et al., 2023). Gradient descent over DLNs is shown to converge linearly (in iteration count) to the entropically regularized LP solution; initialization controls the entropy penalty. Applications include optimal transport (via Sinkhorn-like initialization) and $\ell_1$ -basis pursuit (Wang et al., 2023).

Diagonal linear RNNs—employing channel-wise recurrence—allow efficient and parallelizable fixed-point iterations that can universally approximate the expressive power of dense RNNs using low-rank channel mixers, provided sufficient depth/iterations (Movahedi et al., 13 Mar 2025).

A general IRLS–DLN connection has also been established: alternating reweighting and least squares on DLN parameterizations unifies IRLS, lin-RFM, and AM variants, with asymptotic risk and support recovery precisely characterized in the high-dimensional Gaussian design using DMFT (Kaushik et al., 2024, Nishiyama et al., 2 Oct 2025).

6. High-Dimensional Theory, Scaling, and Mean-Field Limits

Dynamical Mean-Field Theory (DMFT) permits reduction of high-dimensional DLN gradient-flow to low-dimensional effective stochastic processes, capturing the full loss dynamics, generalization bias, and the speed/generality trade-off (Nishiyama et al., 2 Oct 2025). The convergence regime splits sharply between large and small initialization—corresponding to "lazy"/kernel and "rich"/feature-learning phases, with explicit timescale separation. Reduced initialization improves generalization (sparsity) but slows convergence.

Scaling laws for the full family of $\ell_r$ -norms of DLN solutions under $\ell_p$ bias have been rigorously characterized (Zhang et al., 25 Sep 2025). Elbow points ( $n_*$ ) and universal norm thresholds ( $r_*=2(p-1)$ ) delineate which norms plateau and which grow, matching explicit minimum- $\ell_p$ interpolation in the overparameterized limit. Initialization scaling permits practitioners to tune the effective bias along the $\ell_p$ spectrum and select stable norm-based generalization metrics (Zhang et al., 25 Sep 2025).

7. Practical Implications and Research Significance

Diagonal Linear Networks provide a tractable testbed for uncovering mechanisms of implicit regularization, optimization geometry, and algorithmic parameterization effects in overparameterized settings. Their analysis precisely predicts phenomena such as incremental learning, early stopping regularization, sharpness minimization, and the impact of stochasticity/momentum on the regularization path. The insights into layer-wise balancing, shrinkage-thresholding, and sharpness-guided optimization contribute directly to the principled design of modern deep learning algorithms—clarifying the implicit effect of standard practices like momentum, step size tuning, random noise, and reparameterization on model generalization and capacity control.

The rigorous equivalence between DLN training dynamics and desirable convex regularization paths (notably with $\ell_1$ and entropic-regularized objectives), and the proven correspondence with basis pursuit, optimal transport, and IRLS performance, situates DLNs at the intersection of statistical learning theory, convex optimization, and algorithmic deep learning research (Berthier, 23 Sep 2025, Wang et al., 2023, Kaushik et al., 2024). This suggests that further exploration of diagonalizable architectures and their optimization geometry will continue to yield transferable understanding for much richer classes of neural and structured models.