Practical Inexact Proximal Quasi-Newton Method with Global Complexity Analysis

Published 26 Nov 2013 in cs.LG, math.OC, and stat.ML | (1311.6547v4)

Abstract: Recently several methods were proposed for sparse optimization which make careful use of second-order information [10, 28, 16, 3] to improve local convergence rates. These methods construct a composite quadratic approximation using Hessian information, optimize this approximation using a first-order method, such as coordinate descent and employ a line search to ensure sufficient descent. Here we propose a general framework, which includes slightly modified versions of existing algorithms and also a new algorithm, which uses limited memory BFGS Hessian approximations, and provide a novel global convergence rate analysis, which covers methods that solve subproblems via coordinate descent.

Abstract PDF Upgrade to Chat

Citations (79)

View on Semantic Scholar

Summary

The paper introduces an inexact proximal quasi-Newton method utilizing randomized coordinate descent to achieve global sublinear convergence.
It demonstrates that modest increases in coordinate steps per iteration suffice to control stochastic subproblem errors effectively.
Empirical results on sparse inverse covariance and logistic regression tasks validate the method’s competitive efficiency.

Practical Inexact Proximal Quasi-Newton Methods: Complexity Theory and Algorithmic Insights

Overview

This paper presents a rigorous theoretical and practical investigation of inexact proximal quasi-Newton (QN) algorithms for composite convex optimization problems, particularly those that involve an $\ell_1$ -norm regularizer. The core focus is on a general framework that encapsulates and extends existing methods, including variants using limited-memory BFGS (LBFGS) Hessian approximations and subproblem solvers based on randomized coordinate descent (RCD). A key contribution is the complete global complexity analysis that quantifies how inexactness in subproblem solutions and Hessian approximations affects convergence, and how those effects can be managed in a provably efficient manner.

Problem Formulation and Algorithmic Structure

The target convex optimization problem is of the form:

$\min_{x \in \mathbb{R}^n} F(x) = f(x) + g(x),$

where $f$ is a smooth convex function with Lipschitz-continuous gradient, and $g$ is a convex, possibly non-smooth, structured regularizer (with a particular focus on $g(x) = \lambda\|x\|_1$ for sparse optimization). Unlike much prior work, $g$ is only required to be efficiently solvable when composed with a general positive-definite quadratic term, not necessarily having a closed-form proximal mapping.

The proposed method is an inexact proximal QN framework with iteration $k$ :

Construct a composite quadratic model at current iterate $x^k$ with Hessian approximation $H_k$ (diagonal, QN, or LBFGS).
Approximately solve the subproblem

$\min_x g(x) + \frac{1}{2}\|x - z\|_{H_k}^2$

for $z = x^k - H_k^{-1}\nabla f(x^k)$ , using a randomized coordinate descent inner loop to specified tolerance.

Update $x^{k+1}$ with the subproblem solution if a sufficient decrease criterion is met; otherwise, adaptively increase the prox parameter.

The framework supports both exact and inexact subproblem solutions, including random approximation errors introduced by stochastic solvers.

Theoretical Developments

Sublinear Global Convergence with Inexactness

The paper provides a nontrivial extension of complexity guarantees from classical first-order methods (ISTA, FISTA) to the inexact QN and LBFGS-proximal regimes. The analysis overcomes the lack of straightforward extendability of standard proximal gradient arguments to QN frameworks by leveraging advanced smooth optimization tools [e.g., Nesterov, Polyak].

Key theoretical results include:

Global sublinear rate for the objective residual $F(x^k) - F^*$ of $O(1/k)$ under standard regularity assumptions, boundedness of Hessian eigenvalues, and appropriately decaying error tolerances for the subproblem solutions.
The required inexactness schedule: It is sufficient that the expected or high-probability subproblem error at iteration $k$ , $\phi_k$ , decays as $O(1/k^2)$ (or even geometrically in the inner loop), to retain sublinear convergence of the overall algorithm.
Randomized coordinate descent subproblems: Using the RCD complexity bounds of [Richtárik & Takáč], the analysis shows that a modestly increasing number of coordinate steps per subproblem (e.g., $O(k)$ or $O(\log k)$ per outer iteration) is sufficient to guarantee the necessary subproblem accuracy with high probability, depending on conditioning.

Practical Algorithm Variant

The framework is instantiated with limited-memory BFGS ( $H_k$ as a diagonal plus low-rank matrix) and randomized coordinate descent. This variant leverages active-set heuristics to restrict the working set.

Notably, the algorithm does not assume exact or even high-quality Hessians, nor does it require strong convexity or smoothness beyond first-order differentiability, making the theory widely applicable. The line search mechanism of prior methods is replaced by a robust proximal parameter update mechanism, guaranteeing sufficient decrease and enabling the establishment of global complexity results even in the inexact / stochastic subproblem regime.

Numerical Results

Comprehensive experiments are performed on benchmark sparse inverse covariance selection (SICS) and sparse logistic regression (SLR) problems. The proposed algorithm (denoted LHAC and its line-search variant LHAC-L) is compared to QUIC and LIBLINEAR, state-of-the-art solvers for SICS and SLR, respectively.

Main empirical findings:

LHAC and its variants achieve comparable or better wall-clock convergence than QUIC and LIBLINEAR on large real-world datasets, especially in the high-sparsity/high-dimensional regime.
Randomized coordinate descent is demonstrated to be as efficient in practice as cyclic coordinate descent, with the additional advantage of rigorous complexity control.
The use of the prox-parameter update (as opposed to Armijo line search) leads to robust and efficient progress, substantiating the theoretical claims.

The iterative workload adapts automatically through the RCD iteration schedule, balancing computational effort and overall progress.

Implications and Future Directions

This work provides a unified analysis and practical instantiation for inexact, second-order, and coordinate-descent-based algorithms for composite convex objectives with structured non-smooth regularizers. The results are particularly relevant for large-scale sparse learning, where Hessian computation is infeasible but low-rank approximations and stochastic subproblem solvers are tractable.

Noteworthy implications include:

Scalable second-order methods become practical even with only inexact, stochastic subproblem solutions, broadening the applicability of QN approaches to massive-scale problems.
The complexity theory reaffirms that practitioner-friendly heuristics (coordinate descent, active sets, limited-memory updates) can be elevated by theoretical guarantees with careful analysis.
The separation of line search and prox-parameter update enables flexible algorithm design without sacrificing convergence properties.

Potential avenues for follow-up:

Extension to accelerated or momentum-based inexact QN frameworks under weaker conditions on Hessian updates.
Theoretical integration and empirical study of asynchronous or distributed coordinate descent.
Adaptive active-set selection strategies and their global complexity analysis.

Conclusion

The paper establishes new theoretical foundations for practical inexact proximal quasi-Newton methods by providing the first global complexity guarantees for such algorithms with randomized coordinate descent subproblem solvers. By broadening the class of admissible Hessian approximations and allowing inexactness in subproblem optimization, this work bridges the gap between scalable, heuristic QN variants and provably convergent large-scale sparse optimization algorithms. The resulting methodology is competitive with highly specialized solvers in real-world experiments and is poised for further extension to other classes of structured regularizers and stochastic settings (1311.6547).

Markdown Report Issue