Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

Published 22 Dec 2024 in math.OC | (2412.17050v2)

Abstract: The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in ML, with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called $(L_0,L_1)$-smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that $(L_0,L_1)$-GD has the following behavior in the convex setup: as long as $|\nabla f(x^k)| \geq \frac{L_0}{L_1}$ the algorithm has linear convergence in function suboptimality, and when $|\nabla f(x^k)| < \frac{L_0}{L_1}$ is satisfied, $(L_0,L_1)$-GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: Normalized Gradient Descent as well as Clipped Gradient Descent (the case when the full gradient $\nabla f(x)$ is available); Random Coordinate Descent (when the gradient component $\nabla_{i} f(x)$ is available); Random Coordinate Descent with Order Oracle (when only $\text{sign} [f(y) - f(x)]$ is available). In addition, we also extend our analysis of $(L_0,L_1)$-GD to the strongly convex case.