Decentralized Nonsmooth Nonconvex Optimization

Updated 3 February 2026

Decentralized nonsmooth nonconvex optimization is a framework for collaboratively minimizing nondifferentiable and nonconvex objectives across networked agents.
The approach leverages first-order, proximal, and gradient-free techniques to handle structured problems, including minimax and nonlinear constraints, with robust convergence guarantees.
Recent advancements demonstrate rate-optimal convergence to generalized stationary points while improving communication efficiency via consensus and spectral gap methods.

Decentralized nonsmooth nonconvex optimization addresses the collaborative minimization or saddle-point problems of nonconvex and nonsmooth objectives distributed over a network of agents communicating under a specified topology. This area encompasses first-order, proximal, gradient-free, and stochastic methods, as well as minimax structures and nonlinear or coupled constraints. Recent advances have led to algorithms with provable global and sometimes rate-optimal convergence to generalized stationary points, often leveraging innovations in consensus control, randomized smoothing, subdifferential calculus, and online-to-nonconvex reduction.

1. Problem Formulation and Mathematical Setting

The canonical decentralized nonsmooth nonconvex optimization problem involves $n$ agents, each with access to a private local cost $f_i:\mathbb R^d\to\mathbb R$ (possibly stochastic and nondifferentiable), seeking to minimize the aggregate

$\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$

Communication is restricted by a network graph $\mathcal{G}$ , often modeled by a symmetric, doubly-stochastic mixing matrix $P$ with spectral gap $\gamma=1-\lambda_2(P)$ , where $\lambda_2(P)$ is the second largest eigenvalue. Agents may only communicate with their direct neighbors. In the nonsmooth nonconvex regime, each $f_i$ is typically assumed Lipschitz continuous (possibly via local stochastic samples), and optimization is with respect to the Goldstein (δ,ε)-stationarity criterion: $\min_{g\in\partial_\delta f(x)}\|g\| \leq \varepsilon,$ where $\partial_\delta f(x)$ denotes the δ-Goldstein subdifferential, i.e., the convex hull of Clarke subgradients taken over a δ-ball around $f_i:\mathbb R^d\to\mathbb R$ 0 (Chen et al., 27 Jan 2026, Lin et al., 2023, Sahinoglu et al., 2024). For structured problems, such as nonconvex–strongly concave minimax or composite objectives, additional local regularizers $f_i:\mathbb R^d\to\mathbb R$ 1 and $f_i:\mathbb R^d\to\mathbb R$ 2 may be included, and constraints may involve nonlinear or nonconvex couplings (Xu, 2023, Yang et al., 2020).

2. Algorithmic Methodologies

2.1. First-Order and Subgradient-Based Algorithms

Decentralized stochastic subgradient descent (DSGD) forms the primary baseline. Each agent updates its local copy $f_i:\mathbb R^d\to\mathbb R$ 3 via a local subgradient $f_i:\mathbb R^d\to\mathbb R$ 4 (possibly stochastic), combines updates from neighbors via $f_i:\mathbb R^d\to\mathbb R$ 5, and averages accordingly: $f_i:\mathbb R^d\to\mathbb R$ 6 with diminishing stepsizes $f_i:\mathbb R^d\to\mathbb R$ 7 (Kungurtsev, 2019, Zhang et al., 2024). Ergodic convergence to Clarke-stationary (or stable set) points is guaranteed under standard SA assumptions.

2.2. Proximal and Composite Approaches

When the problem involves composite terms $f_i:\mathbb R^d\to\mathbb R$ 8, with $f_i:\mathbb R^d\to\mathbb R$ 9 nonsmooth (possibly nonconvex and proximable), Prox-DGD applies the operator splitting: $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 0 enabling treatment of $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 1 quasi-norms, SCAD, MCP, and indicator functions of (possibly nonconvex) sets (Zeng et al., 2016).

2.3. Gradient-Free and Zeroth-Order Methods

When only function value oracles (not gradients) are available, two-point randomized smoothing estimators are utilized. DGFM (Lin et al., 2023) and ME-DOL (Sahinoglu et al., 2024) build SUFFICIENTLY accurate surrogates via

$\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 2

enabling decentralized, gradient-tracking–augmented updates. Variance-reduced variants (e.g., DGFM $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 3) further improve query complexity.

2.4. Minimax and Nonlinear Constrained Problems

Structured minimax problems (e.g., decentralized nonconvex–strongly-concave) require simultaneous minimization over $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 4 and maximization over $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 5, with possible nonsmooth regularizers. The D-GDMax method (Xu, 2023) reformulates the saddle-point to absorb dual consensus constraints into Lagrange multipliers, enabling exact local maximization (not just a gradient ascent step) and more aggressive stepsizes, decoupling consensus and nonsmoothness in dual variables. For nonlinear constraints (e.g., $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 6), methods such as PLDM use proximal linearization together with an augmented Lagrangian scheme to avoid exact local solves at each iteration (Yang et al., 2020).

3. Convergence, Complexity, and Theoretical Guarantees

3.1. Stationarity Measures

Owing to nonsmoothness and nonconvexity, stationarity is measured via:

Clarke stationarity: $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 7 (Kungurtsev, 2019, Zhang et al., 2024)
Goldstein- $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 8: $\min_{x\in\mathbb R^d}~f(x) := \sum_{i=1}^n f_i(x).$ 9 (Chen et al., 27 Jan 2026, Lin et al., 2023, Sahinoglu et al., 2024)
$\mathcal{G}$ 0-critical KKT residuals for constrained problems (Yang et al., 2020).

3.2. Complexity Bounds

Optimal (dimension-independent) sample complexity for $\mathcal{G}$ 1-stationarity in the decentralized, nonsmooth, nonconvex, stochastic first-order setting is $\mathcal{G}$ 2 (Sahinoglu et al., 2024, Chen et al., 27 Jan 2026). For zeroth-order/gradient-free settings, the best results match the centralized bound up to a polynomial dimension factor: $\mathcal{G}$ 3 Communication complexity generally matches sample complexity times a network-dependent factor (inverse spectral gap). Chebyshev-accelerated gossip contracts consensus error at $\mathcal{G}$ 4 per iteration, where $\mathcal{G}$ 5 is the mixing matrix spectral gap (Chen et al., 27 Jan 2026).

3.3. Global and Local Convergence

Asymptotic convergence (without nonasymptotic rates) is established for decentralized stochastic subgradient methods under mild assumptions via perturbed differential inclusion and Lyapunov methods (Kungurtsev, 2019, Zhang et al., 2024). PLDM and related augmented Lagrangian/prox-linear methods guarantee convergence to critical points under the Kurdyka–Łojasiewicz property, with possible linear (or sublinear) rates depending on the KL exponent (Yang et al., 2020, Zeng et al., 2016).

3.4. Complexity Comparison Table

Algorithm	Setting	Sample Complexity
ME-DOL	First-/zeroth-order	$\mathcal{G}$ 6 (Sahinoglu et al., 2024)
DGFM $\mathcal{G}$ 7	Zeroth-order	$\mathcal{G}$ 8 (Lin et al., 2023)
DOCS	First-/zeroth-order	$\mathcal{G}$ 9, comm. $P$ 0 (Chen et al., 27 Jan 2026)
D-GDMax	Minimax, NCSC composite	$P$ 1 (Xu, 2023)
Prox-DGD	Proximable composite	$P$ 2 ergodic (convex), sublinear otherwise (Zeng et al., 2016)

4. Fundamental Techniques: Smoothing, Consensus, and Subdifferential Calculus

4.1. Randomized Smoothing

Randomized smoothing approximates a nonsmooth $P$ 3 by $P$ 4, yielding a smooth surrogate whose gradient is in the Goldstein subdifferential: $P$ 5. This underpins both theoretical analysis and practical implementations in DGFM, ME-DOL, and related methods (Lin et al., 2023, Sahinoglu et al., 2024).

4.2. Consensus Mechanisms

Almost all decentralized algorithms deploy spectral-mixing (gossip or Metropolis) matrices to control disagreement. Temporally decaying step-sizes and Chebyshev-accelerated consensus (especially for communication-critical settings) are essential for provable convergence in sparse or poorly connected graphs (Kungurtsev, 2019, Chen et al., 27 Jan 2026). The spectral gap determines the rate of consensus contraction.

4.3. Gradient Tracking and Variance Reduction

Decentralized gradient-tracking adds auxiliary variables to enable the local recovery of global directional information, improving error contraction and sample complexity. Variance reduction via SPIDER or multi-batch schemes further improves efficiency in the stochastic zeroth-order regime (Lin et al., 2023).

4.4. Subgradient and Set-Valued Analysis

Owing to nonsmoothness, the analysis requires careful use of generalized subdifferentials (Clarke, Goldstein, or conservative field mappings), together with SA-based or differential inclusion arguments for convergence (Zhang et al., 2024).

5. Structured Problem Classes and Applications

5.1. Composite Minimax Optimization

D-GDMax targets decentralized nonconvex–strongly-concave minimax games with convex nonsmooth terms in both variables. Reformulation introduces local copies and dual variables, allowing aligned maximization and decoupled nonsmoothness handling, achieving improved complexity and global convergence guarantees (Xu, 2023).

5.2. Nonlinear and Coupled Constraints

PLDM addresses decentralized problems with nonlinear equality and bound constraints by combining local proximal linearization with Gauss–Seidel updates and adaptive penalty Lagrangian mechanisms (Yang et al., 2020). This technique avoids heavy local solves required in ADMM-like frameworks and enables provable convergence in coupled settings.

5.3. Empirical Benchmarks

Applications include:

Distributionally robust logistic regression (minimax, D-GDMax) (Xu, 2023).
Nonconvex SVM with capped- $P$ 6 penalty and adversarial attacks (DGFM, ME-DOL, DOCS) (Lin et al., 2023, Chen et al., 27 Jan 2026, Sahinoglu et al., 2024).
Federated deep neural network training (subgradient-based, ResNet on CIFAR, ReLU networks) (Zhang et al., 2024).

Empirical results consistently confirm the theoretical iteration/sample/communication advantages of the corresponding algorithms over previous baselines.

6. Open Directions and Future Work

Relevant challenges and future prospects include:

Extension to general (merely concave or even nonconcave) dual variables in minimax problems (Xu, 2023).
Acceleration via variance reduction, adaptive stepsizes, or momentum in nonsmooth/nonconvex decentralized regimes (Lin et al., 2023, Zhang et al., 2024).
Precise trade-off analysis between communication, computation, and sample complexity, especially as network topology varies (Chen et al., 27 Jan 2026).
Handling structured constraints using non-Euclidean prox setups, or relaxing smoothness/regularity conditions via advanced subgradient interpolation or bundle methods (Yang et al., 2020, Lin et al., 2023).
Global rates beyond asymptotic guarantees, particularly $P$ 7 or faster for classes with additional structure (KL property, weak convexity) (Zeng et al., 2016, Yang et al., 2020).

Fundamental questions persist regarding information-theoretic lower bounds, robustness to heterogeneous stochasticity, and the design of adaptive, communication-efficient decentralized protocols capable of scaling to extremely large networks or high-dimensional Lipschitz nonconvex regimes.