Scaled Gradient Descent for Ill-Conditioned Low-Rank Matrix Recovery with Optimal Sampling Complexity

Published 31 Mar 2026 in stat.ML, cs.IT, and cs.LG | (2604.00060v1)

Abstract: The low-rank matrix recovery problem seeks to reconstruct an unknown $n_1 \times n_2$ rank-$r$ matrix from $m$ linear measurements, where $m\ll n_1n_2$. This problem has been extensively studied over the past few decades, leading to a variety of algorithms with solid theoretical guarantees. Among these, gradient descent based non-convex methods have become particularly popular due to their computational efficiency. However, these methods typically suffer from two key limitations: a sub-optimal sample complexity of $O((n_1 + n_2)r^2)$ and an iteration complexity of $O(κ\log(1/ε))$ to achieve $ε$-accuracy, resulting in slow convergence when the target matrix is ill-conditioned. Here, $κ$ denotes the condition number of the unknown matrix. Recent studies show that a preconditioned variant of GD, known as scaled gradient descent (ScaledGD), can significantly reduce the iteration complexity to $O(\log(1/ε))$. Nonetheless, its sample complexity remains sub-optimal at $O((n_1 + n_2)r^2)$. In contrast, a delicate virtual sequence technique demonstrates that the standard GD in the positive semidefinite (PSD) setting achieves the optimal sample complexity $O((n_1 + n_2)r)$, but converges more slowly with an iteration complexity $O(κ² \log(1/ε))$. In this paper, through a more refined analysis, we show that ScaledGD achieves both the optimal sample complexity $O((n_1 + n_2)r)$ and the improved iteration complexity $O(\log(1/ε))$. Notably, our results extend beyond the PSD setting to general low-rank matrix recovery problem. Numerical experiments further validate that ScaledGD accelerates convergence for ill-conditioned matrices with the optimal sampling complexity.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that ScaledGD achieves optimal O((n1+n2)r) sample complexity and logarithmic iteration convergence independent of the condition number using a novel decoupling technique.
The method leverages refined spectral initialization and virtual sequence decoupling to overcome non-convexity limitations in recovering general (non-PSD) low-rank matrices.
Numerical experiments confirm that ScaledGD outperforms vanilla and Riemannian GD in both iteration count and runtime, especially under ill-conditioning.

Scaled Gradient Descent for Ill-Conditioned Low-Rank Matrix Recovery

Introduction and Problem Setting

The paper "Scaled Gradient Descent for Ill-Conditioned Low-Rank Matrix Recovery with Optimal Sampling Complexity" (2604.00060) provides an in-depth analysis of non-convex optimization for low-rank matrix recovery, targeting fundamental algorithmic limitations encountered in existing methods. The recovery problem is to reconstruct an unknown rank- $r$ matrix $X \in \mathbb{R}^{n_1 \times n_2}$ from $m \ll n_1 n_2$ linear measurements of the form $y = \mathcal{A}(X)$ . This setting unifies various learning and signal processing tasks, with prior convex and non-convex solutions reaching strong but incomplete theoretical guarantees.

Gradient descent (GD) on a factorized parameterization, $X = LR^\top$ , is computationally appealing but achieves suboptimal sample complexity $O((n_1 + n_2) r^2)$ and iteration complexity scaling linearly with the condition number $\kappa$ of $X$ . Existing preconditioned (scaled) GD methods remedy the condition number dependence in convergence, attaining $O(\log(1/\epsilon))$ iteration complexity, but they do not break the suboptimal $r^2$ scaling in sample complexity. Meanwhile, recent advances demonstrate information-theoretic optimal $X \in \mathbb{R}^{n_1 \times n_2}$ 0 sample complexity for GD in the positive semidefinite (PSD) setting, but at the cost of even worse iteration complexity, $X \in \mathbb{R}^{n_1 \times n_2}$ 1, and with limited applicability to general matrices.

Main Theoretical Contributions

The primary contribution of the paper is a rigorous analysis showing that the Scaled Gradient Descent (ScaledGD) algorithm achieves both optimal sample complexity $X \in \mathbb{R}^{n_1 \times n_2}$ 2 and fast, condition-number-independent iteration complexity $X \in \mathbb{R}^{n_1 \times n_2}$ 3 for general (asymmetric and non-PSD) low-rank matrix recovery under Gaussian design. This is enabled by a refined decoupling technique based on virtual sequences, extending and improving on earlier work limited to the PSD case or with higher per-iteration complexity.

Specifically, the authors prove that, with $X \in \mathbb{R}^{n_1 \times n_2}$ 4 Gaussian measurements and suitable spectral initialization, ScaledGD converges as

$X \in \mathbb{R}^{n_1 \times n_2}$ 5

uniformly over all $X \in \mathbb{R}^{n_1 \times n_2}$ 6, with $X \in \mathbb{R}^{n_1 \times n_2}$ 7 absolute constants and step size $X \in \mathbb{R}^{n_1 \times n_2}$ 8 independent of $X \in \mathbb{R}^{n_1 \times n_2}$ 9. The analysis crucially leverages a virtual sequence construction—a family of auxiliary iterates decoupled from the data matrices used in each update—which enables tight operator-norm control and circumvents the traditional limitations of RIP-based bounding in non-convex settings.

This formally establishes that ScaledGD matches the optimal sample complexity of convex (nuclear norm) methods while retaining the computational efficiency and flexibility of factorized non-convex optimization. Unlike prior approaches, the result holds for general low-rank matrices, not just PSD ones, and achieves convergence rates independent of the underlying conditioning.

Numerical Results

The paper presents a series of numerical experiments comparing ScaledGD, vanilla GD, and Riemannian GD (RGD) in both well-conditioned and ill-conditioned regimes. The experiments show that ScaledGD not only achieves faster convergence per iteration but also matches or exceeds the practical efficiency of RGD, while outperforming standard GD as the condition number increases.

Figure 1: Relative error trajectories for ScaledGD, GD, and RGD versus iteration count (left) and versus runtime (right) for $m \ll n_1 n_2$ 0, $m \ll n_1 n_2$ 1, $m \ll n_1 n_2$ 2, and $m \ll n_1 n_2$ 3.

As visualized above, ScaledGD demonstrates superior convergence profiles, both in iteration-wise and time-wise metrics.

Figure 2: Computational cost of different algorithms under varying condition numbers $m \ll n_1 n_2$ 4; ScaledGD and RGD show stable cost, while GD scales poorly as $m \ll n_1 n_2$ 5 increases.

The robustness of ScaledGD to ill-conditioning is quantitatively established: its runtime to high-accuracy recovery remains nearly flat as $m \ll n_1 n_2$ 6 increases, in stark contrast to the linear increase exhibited by vanilla GD.

Figure 3: Phase transition diagram over number of measurements $m \ll n_1 n_2$ 7 and target rank $m \ll n_1 n_2$ 8; the linear dependency of the success threshold on $m \ll n_1 n_2$ 9 confirms the theoretically predicted sample complexity.

The phase transition boundary matches the predicted $y = \mathcal{A}(X)$ 0 dependence, validating the optimality of the proved sample complexity in practice.

Implications, Limitations, and Future Directions

This work decisively addresses two major obstacles in non-convex low-rank matrix recovery: suboptimal sample complexity due to factorization and slow convergence for ill-conditioned solutions. By demonstrating that ScaledGD, coupled with a careful spectral initialization and decoupling argument, simultaneously achieves optimal sample and iteration complexities, it narrows the theoretical gap with convex approaches while remaining computationally scalable.

The result further dismisses the folklore that information-theoretic sample optimality is necessarily accompanied by slower convergence in non-convex regimes. The removal of the PSD restriction expands applicability to a wide class of problems in signal processing, machine learning, and beyond.

However, the analysis retains an explicit $y = \mathcal{A}(X)$ 1 dependence in the required number of measurements, similar to prior non-convex methods. In contrast, convex optimization can reach sample complexity independent of $y = \mathcal{A}(X)$ 2. Bridging this remaining gap—potentially via improved initialization or more sophisticated regularization—remains an open avenue. Similarly, adaptation of the proof to random or small-norm initialization (more common in practice), and extension to overparameterized regimes where the search rank exceeds the true rank, present interesting directions for future research.

Conclusion

The paper rigorously establishes that Scaled Gradient Descent achieves both information-theoretic sample optimality and rapid, condition-number-independent convergence for general low-rank matrix recovery under Gaussian measurements. Its analytic innovations—particularly decoupling via virtual sequences—resolve longstanding theoretical limitations in non-convex matrix optimization, as substantiated by numerical experiments confirming both efficiency and robustness to ill-conditioning. The work signals the practical efficiency of scalable non-convex algorithms even at theoretical limits, while motivating further advances to eliminate residual condition number dependence in sampling and explore more general initialization schemes.

Markdown Report Issue