Orthant-Based Proximal SG (OBProx-SG)
- The paper presents OBProx-SG, a method that alternates proximal stochastic gradient and orthant projection steps to aggressively promote sparsity in l1-regularized optimization.
- It establishes global convergence under nonconvexity and linear convergence in strongly convex settings through a carefully designed modulo switching schedule.
- Empirical evaluations demonstrate that OBProx-SG significantly reduces model density while maintaining predictive accuracy, outperforming traditional methods.
Orthant-Based Proximal Stochastic Gradient methods (OBProx-SG) address the efficient solution of -regularized optimization problems, which arise in domains such as feature selection and model compression. The canonical form is , where is an average over smooth—possibly nonconvex—component losses, and the -regularization promotes sparsity. OBProx-SG unifies the global convergence properties of proximal stochastic gradient approaches with aggressive sparsity promotion via orthant projection, yielding solutions with substantially reduced support while maintaining per-iteration cost comparable to standard stochastic proximal methods (Chen et al., 2020).
1. Problem Setting and Motivation
The primary objective is to solve
where are smooth and . In the convex scenario, each is convex and -smooth. If is nonconvex, it is assumed to be continuously differentiable with Lipschitz gradients within a compact level set, and stochastic gradients possess bounded variance. The -term induces sparsity by shrinking coefficients toward zero, which is crucial in high-dimensional settings for interpretability and computational efficiency.
2. OBProx-SG Algorithmic Structure
OBProx-SG alternates between two subroutines—a Proximal Stochastic Gradient (Prox-SG) Step and an Orthant Step—using a simple modulo switching schedule determined by user-specified integers and . The control flow is as follows:
- Input: Initial point , step size , , .
- For :
- If , perform a Prox-SG Step;
- Else, perform an Orthant Step;
- Update step size .
2.1 Prox-SG Step
Given and step size , sample a mini-batch and compute the stochastic gradient . The update is
This amounts to a coordinate-wise soft-thresholding operation promoting sparsity, with
2.2 Orthant Step
Define sets , , , with restricted to the spanning orthant face
where coordinates in are fixed to zero and those in keep their sign. On , the objective simplifies to a smooth function .
The step comprises:
- Stochastic gradient computation for :
- Gradient descent update: .
- Orthant projection: for each coordinate,
The switching schedule is governed by , ; OBProx-SG is a variant with , , i.e., finitely many Prox-SG Steps followed by only Orthant Steps.
3. Convergence Guarantees
Analysis proceeds under the condition that has -Lipschitz stochastic gradients, is bounded below, and that is uniformly bounded and unbiased with variance .
Key Theoretical Results:
- Global Convergence: If , then in general (possibly nonconvex) cases, where is the proximal-gradient mapping.
- Linear Rate for Strong Convexity: If is -strongly convex and , the following holds:
where counts the cumulative Prox-SG Steps and is determined by the starting level set.
- Support Identification and OBProx-SG: When , under mild local convexity and proper initialization, Orthant Steps alone can drive the proximal mapping norm to zero once the correct orthant is identified.
- High-Probability Support Identification: In the -strongly convex case, after Prox-SG steps, the iterate is in a neighborhood of the correct support with high probability.
4. Comparative Analysis with Related Methods
OBProx-SG is contrasted with prevalent stochastic schemes for -regularized problems:
| Method | Sparsity Promotion | Convergence Rate/Cost |
|---|---|---|
| Prox-SG | Shrinkage region (moderate) | or linear for strong convexity; slow |
| RDA | Averaged gradient enlarges truncation region (aggressive) | More aggressive sparsity, but slower convergence |
| Prox-SVRG | Moderate (like Prox-SG) | Linear in convex case; needs full-gradient per epoch (costly) |
| OBProx-SG | Orthant face projection, much larger “zero region” (aggressive) | Linear in Prox-SG steps; one mini-batch per iteration |
The Orthant Step in OBProx-SG features a substantially larger zero region for each positive coordinate, leading to aggressive fabrication of sparsity at a low computational burden per iteration (Chen et al., 2020).
5. Empirical Evaluation
Empirical studies evaluate OBProx-SG and OBProx-SG on convex and nonconvex -regularized problems:
5.1 Convex Case
- Datasets: a9a, higgs, kdda, news20, real-sim, rcv1, url_combined, w8a; .
- OBProx-SG and OBProx-SG achieve objective values on par with Prox-SG/Prox-SVRG and outperform RDA.
- Solution density (fraction of nonzeros): Prox-SG $80$–, RDA $40$–, Prox-SVRG $3$–, OBProx-SG $0.1$–, OBProx-SG up to lower than OBProx-SG.
- Runtime: OBProx-SG, Prox-SG, and RDA are comparable; Prox-SVRG incurs $2$– higher computational cost.
5.2 Nonconvex Case
- Tasks: -regularized MobileNetV1 and ResNet18 on CIFAR-10 and Fashion-MNIST.
- OBProx-SG and Prox-SG/Prox-SVRG obtain statistically equivalent objective values and test accuracy (within $1$–), with RDA showing degraded performance.
- Density: Prox-SG $5$– nonzeros, Prox-SVRG/RDA , OBProx-SG $2$–, OBProx-SG $0.3$–—leading to up to sparser networks without loss in accuracy.
- Sparsity evolution reveals a rapid density decrease post transition to Orthant Steps.
6. Implementation Considerations and Theoretical Rationale
- Prox-SG analysis ensures expected decrease in , yielding convergence to vanishing mapping norm or to a neighborhood under strong convexity.
- Once covers the true solution support, the problem is smooth using the Orthant Step, which converges (with decaying step sizes) to the global or stationary point with the same support.
- High probability of correct orthant identification is established via concentration inequalities under strong convexity.
- Efficient implementation is facilitated: track only the current sign vector and nonzero index set . In the Orthant Step, only the relevant gradient needs computation; coordinates outside the orthant are projected to zero.
- The modulo switching schedule is lightweight, and both subroutines leverage the same mini-batch sampling, gradient, and projection logic.
7. Synthesis and Impact
OBProx-SG demonstrates unified globalization of stochastic proximal gradient methods with aggressive support identification via orthant projection. This design enables provable convergence to global optima under convexity (or stationary points for nonconvex ), linear rates under strong convexity, and empirical reduction in iterate density, all with low per-iteration complexity. In both convex and deep learning applications (e.g., MobileNetV1, ResNet18), OBProx-SG achieves markedly improved sparsity over established methods without degrading objective value or predictive performance (Chen et al., 2020).