Proximal Gradient Descent: Theory & Applications

Updated 28 January 2026

Proximal Gradient Descent is a method for composite optimization, combining a gradient step on smooth functions with a proximal step for nonsmooth regularizers.
It enables practical solutions for high-dimensional problems in sparse regression, matrix factorization, and signal recovery through adaptive and accelerated variants.
Recent advances demonstrate rigorous convergence guarantees, distributed implementations, and integration with learned regularizers for enhanced performance.

Proximal gradient descent is a foundational algorithmic framework in convex and nonconvex optimization for minimizing composite objective functions of the form

$\min_{x \in \mathbb{R}^n} F(x) := f(x) + g(x),$

where $f$ is typically smooth (i.e., differentiable with Lipschitz-continuous gradient) and $g$ is convex but potentially nonsmooth. The method iteratively combines explicit gradient descent on $f$ with implicit regularization via the proximal operator of $g$ , which is essential in high-dimensional settings with structured priors such as sparsity or low-rank. This approach undergirds algorithms for a wide array of regularized regression, signal recovery, matrix factorization, and machine learning applications, and recent work has established rigorous convergence rates, acceleration phenomena, distributed variants, and integration with learned regularizers.

1. Mathematical Formulation and Core Algorithms

Let $f: \mathbb{R}^n \to \mathbb{R}$ be convex and $L$ -smooth ( $\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\|$ ), and $g: \mathbb{R}^n \to (-\infty, +\infty]$ be proper, closed, and convex (possibly nonsmooth). The proximal operator for $g$ at $f$ 0 with parameter $f$ 1 is defined as

$f$ 2

The proximal gradient iteration is

$f$ 3

For regularizers like the $f$ 4 norm, this reduces to soft-thresholding; for nuclear norm penalties, the proximal step is singular value thresholding (Nikolovski et al., 2024, Zhang et al., 2015).

Accelerated schemes (e.g., FISTA) and inertial variants (proximal heavy-ball, $f$ 5-momentum) are formulated by supplementing the update with momentum terms or potential-function-based modifications, providing faster convergence in many cases (Sun et al., 2018, Chen et al., 2022, Bok et al., 2024).

2. Theoretical Guarantees and Rate Results

Proximal gradient descent enjoys the following canonical convergence properties:

For convex, $f$ 6-smooth $f$ 7 and convex $f$ 8, with fixed step sizes $f$ 9, the method achieves $g$ 0 convergence in objective value (Nikolovski et al., 2024, Malitsky et al., 2023, Chen et al., 2022).
If $g$ 1 is additionally $g$ 2-strongly convex, linear convergence holds with rate $g$ 3 (Zhang et al., 2019, Nikolovski et al., 2024).
The norm of the proximal-gradient mapping,

$g$ 4

contracts exactly by a factor $g$ 5 (Chen et al., 2022, Zhang et al., 2019). With optimal step size $g$ 6, this yields the tightest linear constant.

Under the Polyak–Łojasiewicz (PL) inequality, an improved objective convergence rate is attainable—specifically, $g$ 7 with appropriate $g$ 8 (Zhang et al., 2019).

Recent work constructs potential-function frameworks leveraging norm-monotonicity of $g$ 9 and refined descent lemmas, yielding tight $f$ 0 and accelerated $f$ 1 (function value), $f$ 2 (mapping norm) rates for composite problems (Chen et al., 2022).

Adaptive step-size rules in ProxGD, driven by observed local gradient differences, permit larger per-iteration steps and require only local Lipschitzness, not a global $f$ 3 (Malitsky et al., 2023, Nikolovski et al., 2024).

3. Variants: Acceleration, Momentum, and Adaptive Step Sizes

Several lines of research extend proximal gradient descent in the following directions:

Momentum and inertia: The proximal inertial gradient descent (PIGD) update,

$f$ 4

achieves non-ergodic $f$ 5 rates for convex problems with constant momentum (under coercivity), and linear rates under error-bound conditions (Sun et al., 2018).

Accelerated step schedules: The "silver" stepsize schedule, based on a quasifractal pattern related to the silver ratio, accelerates the rate of vanilla PGD for smooth convex $f$ 6 from $f$ 7 iterations to $f$ 8 iterations without use of momentum or extrapolation. Under strong convexity, this generalizes to $f$ 9 (Bok et al., 2024).
Adaptive step size: Step sizes estimated from local curvature, $g$ 0 with $g$ 1 computed from gradient differences, allow for more aggressive updates and maintain theoretical $g$ 2 rates without global Lipschitz constants (Malitsky et al., 2023, Nikolovski et al., 2024).

4. Distributed, Stochastic, and Variance-Reduced Proximal Schemes

For large-scale and distributed optimization, proximal schemes have been extended by:

Stochastic PGD: Replacing full gradients with stochastic (possibly minibatch) estimators enables scalable optimization for large data. Standard Prox-SGD achieves $g$ 3 (convex) or $g$ 4 (strongly convex) convergence (Zhang et al., 2015).
Variance reduction: Epoch-based schemes (SVRG, SAGA) incorporated into proximal stochastic updates yield linear convergence guarantees for empirical risk minimization (Huo et al., 2016). The DAP-SVRG algorithm achieves linear convergence for strongly convex problems by combining asynchronous variance-reduction, worker-side proximal steps, and elementwise server updates.
Decoupled asynchronous variants: To minimize bottlenecks, DAP-SGD and DAP-SVRG offload proximal operator computations to worker nodes, permitting highly parallel, lock-free updates. The master only aggregates elementwise corrections, allowing near-linear speedup with the number of workers and supporting composite regularizers such as group $g$ 5, nuclear norm, and fused lasso (Li et al., 2016, Huo et al., 2016).
Low-rank and structure-exploiting approaches: For high-dimensional matrix problems, stochastic proximal updates leverage low-rank sketches for the gradient, yielding space complexity $g$ 6 rather than $g$ 7 and convergence rates $g$ 8 (convex) and $g$ 9 (strongly convex) (Zhang et al., 2015).

5. Extensions: Bregman Geometry, Online Optimization, and Plug-and-Play

Bregman Proximal Gradient Descent (BPGD): PGD can be generalized to problems endowed with a geometry induced by a strictly convex function $f: \mathbb{R}^n \to \mathbb{R}$ 0 via Bregman divergence $f: \mathbb{R}^n \to \mathbb{R}$ 1. When $f: \mathbb{R}^n \to \mathbb{R}$ 2 is relatively smooth w.r.t $f: \mathbb{R}^n \to \mathbb{R}$ 3, BPGD maintains descent with $f: \mathbb{R}^n \to \mathbb{R}$ 4 rates, and, under relative strong convexity or Bregman-PL, achieves global linear convergence. Multilevel Bregman schemes (ML-BPGD) exploit hierarchical discretizations and accelerate convergence for imaging and inverse problems (Elshiaty et al., 4 Jun 2025).
Online and inexact proximal updates: In streaming or adversarial dynamic settings, proximal OGD tracks a shifting composite optimum and admits dynamic regret bounds scaling with the path length of the optimum and cumulative gradient error (Dixit et al., 2018).
Plug-and-Play and learned regularization: Proximal-gradient algorithms lend themselves to integration with learned proximal maps (e.g., deep denoisers). Plug-and-play (PnP) methods replace the analytic proximal operator with a pretrained denoiser, requiring the denoiser to be proximal of some (possibly nonconvex, weakly convex) functional. Convergence can be recovered by introducing relaxation parameters in the update and controlled step size, with sufficient conditions on the regularization and data-fidelity weights (Hurault et al., 2023, Luo et al., 2022, Mardani et al., 2018).

6. Practical Implementation and Applications

Implementation: Each iteration requires one gradient evaluation and one proximal step, both of which can be computed efficiently for many standard regularizers (soft-thresholding for $f: \mathbb{R}^n \to \mathbb{R}$ 5, groupwise shrinkage, or singular value thresholding) (Nikolovski et al., 2024). Adaptive and acceleration schemes require minor bookkeeping or curvature estimation. Distributed and asynchronous execution (DAP-type algorithms) permit scalable training over large data and models with minimal master bottleneck.
Applications: Proximal gradient descent and its variants are core solvers for compressed sensing, sparse regression, low-rank matrix recovery, robust PCA, penalized multi-block CCA, and large-scale empirical risk minimization (Guan, 2022, Zhang et al., 2015, Nikolovski et al., 2024, Mardani et al., 2018, Luo et al., 2022).
Empirical findings: Variable step-size and adaptive PGD frequently reduce wall time and iteration counts by 40–50% relative to conservative fixed-step methods (Malitsky et al., 2023, Nikolovski et al., 2024). Nonlinear learned proximals in imaging and inverse problems yield substantial empirical gains, e.g., $f: \mathbb{R}^n \to \mathbb{R}$ 6 dB PSNR improvement in MRI reconstruction, over traditional $f: \mathbb{R}^n \to \mathbb{R}$ 7 regularization (Luo et al., 2022).

Proximal gradient descent interpolates between pure first-order methods for smooth losses and non-smooth minimization by subgradients or projections. Its key distinctions include:

Efficiently handling composite objectives where $f: \mathbb{R}^n \to \mathbb{R}$ 8 is not differentiable (e.g., constraints, sparsity-promoting norms) (Nikolovski et al., 2024, Bok et al., 2024).
Superior scaling and infrastructure compatibility in distributed and asynchronous environments via DAP-SGD, DAP-SVRG (Li et al., 2016, Huo et al., 2016).
Generalizability: variants exist for adaptive and time-varying step sizes, Bregman divergences (handling non-Euclidean geometry), online/dynamic objectives, and learned or plug-and-play proximal operators (Malitsky et al., 2023, Elshiaty et al., 4 Jun 2025, Hurault et al., 2023, Luo et al., 2022).

Proximal gradient descent is thus the central algorithmic primitive for composite regularized optimization, underpinning scalable, robust, and adaptive solutions across theory and modern applications. The extensive theoretical and algorithmic toolkit guarantees performance across diverse regimes—accelerated, distributed, stochastic, adaptive—supported by rigorous and illustrative work in the literature (Nikolovski et al., 2024, Bok et al., 2024, Zhang et al., 2015, Nikolovski et al., 2024, Malitsky et al., 2023, Chen et al., 2022, Sun et al., 2018, Li et al., 2016, Huo et al., 2016, Elshiaty et al., 4 Jun 2025, Zhang et al., 2019, Guan, 2022, Hurault et al., 2023, Luo et al., 2022, Dixit et al., 2018, Mardani et al., 2018).