Orthant-Based Proximal SG (OBProx-SG)

Updated 29 January 2026

The paper presents OBProx-SG, a method that alternates proximal stochastic gradient and orthant projection steps to aggressively promote sparsity in l1-regularized optimization.
It establishes global convergence under nonconvexity and linear convergence in strongly convex settings through a carefully designed modulo switching schedule.
Empirical evaluations demonstrate that OBProx-SG significantly reduces model density while maintaining predictive accuracy, outperforming traditional methods.

Orthant-Based Proximal Stochastic Gradient methods (OBProx-SG) address the efficient solution of $\ell_1$ -regularized optimization problems, which arise in domains such as feature selection and model compression. The canonical form is $\min_x F(x) = f(x) + \lambda\|x\|_1$ , where $f(x) = \frac{1}{N} \sum_{i=1}^N f_i(x)$ is an average over smooth—possibly nonconvex—component losses, and the $\ell_1$ -regularization promotes sparsity. OBProx-SG unifies the global convergence properties of proximal stochastic gradient approaches with aggressive sparsity promotion via orthant projection, yielding solutions with substantially reduced support while maintaining per-iteration cost comparable to standard stochastic proximal methods (Chen et al., 2020).

1. Problem Setting and Motivation

The primary objective is to solve

$\min_x F(x) = f(x) + \lambda\|x\|_1, \quad f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x), \quad \lambda > 0,$

where $f_i : \mathbb{R}^n \to \mathbb{R}$ are smooth and $N, n \gg 1$ . In the convex scenario, each $f_i$ is convex and $L$ -smooth. If $f$ is nonconvex, it is assumed to be continuously differentiable with Lipschitz gradients within a compact level set, and stochastic gradients possess bounded variance. The $\ell_1$ -term induces sparsity by shrinking coefficients toward zero, which is crucial in high-dimensional settings for interpretability and computational efficiency.

2. OBProx-SG Algorithmic Structure

OBProx-SG alternates between two subroutines—a Proximal Stochastic Gradient (Prox-SG) Step and an Orthant Step—using a simple modulo switching schedule determined by user-specified integers $N_P$ and $N_O$ . The control flow is as follows:

Input: Initial point $x_0 \in \mathbb{R}^n$ , step size $\alpha_0 > 0$ , $N_P$ , $N_O > 0$ .
For $k = 0,1,2,\dots$ $k = 0, 1, 2, \dots$ :
- If $(k \mod (N_P + N_O)) < N_P$ , perform a Prox-SG Step;
- Else, perform an Orthant Step;
- Update step size $\alpha_{k+1}$ .

2.1 Prox-SG Step

Given $x_k$ and step size $\alpha_k$ , sample a mini-batch $\mathcal{B}_k$ and compute the stochastic gradient $g_k = \nabla f_{\mathcal{B}_k}(x_k)$ . The update is

$x_{k+1} = \operatorname{prox}_{\alpha_k\lambda\|\cdot\|_1}\left(x_k - \alpha_k g_k\right).$

This amounts to a coordinate-wise soft-thresholding operation promoting sparsity, with

$[x_{k+1}]_i = \begin{cases} [x_k - \alpha_k g_k]_i - \alpha_k \lambda & \text{if } [x_k - \alpha_k g_k]_i > \alpha_k\lambda, \ 0 & \text{if } |[x_k - \alpha_k g_k]_i| \leq \alpha_k\lambda, \ [x_k - \alpha_k g_k]_i + \alpha_k \lambda & \text{if } [x_k - \alpha_k g_k]_i < -\alpha_k\lambda. \end{cases}$

2.2 Orthant Step

Define sets $\mathcal{I}^0(x)$ , $\mathcal{I}^+(x)$ , $\mathcal{I}^-(x)$ , with $x$ restricted to the spanning orthant face

$\mathcal{O}_k = \{x : \operatorname{sign}(x_i) = \operatorname{sign}([x_k]_i) \text{ or } x_i = 0, \forall i\},$

where coordinates in $\mathcal{I}^0(x_k)$ are fixed to zero and those in $\mathcal{I}^{\neq 0}(x_k)$ keep their sign. On $\mathcal{O}_k$ , the objective simplifies to a smooth function $\widetilde F(x) = f(x) + \lambda \operatorname{sign}(x_k)^\top x$ .

The step comprises:

Stochastic gradient computation for $\widetilde F$ :

$\hat{g}_k = \frac{1}{|\mathcal{B}_k|}\sum_{i \in \mathcal{B}_k} (\nabla f_i(x_k) + \lambda \operatorname{sign}(x_k))$

Gradient descent update: $\hat x = x_k - \alpha_k \hat{g}_k$ .
Orthant projection: for each coordinate,

$[x_{k+1}]_i = \begin{cases} [\hat x]_i & \text{if } \operatorname{sign}([\hat x]_i) = \operatorname{sign}([x_k]_i), \ 0 & \text{otherwise}. \end{cases}$

The switching schedule is governed by $N_P$ , $N_O$ ; OBProx-SG $^+$ is a variant with $N_P < \infty$ , $N_O = \infty$ , i.e., finitely many Prox-SG Steps followed by only Orthant Steps.

3. Convergence Guarantees

Analysis proceeds under the condition that $f$ has $L$ -Lipschitz stochastic gradients, is bounded below, and that $\nabla\widetilde F_{\mathcal{B}}(x)$ is uniformly bounded and unbiased with variance $\sigma^2$ .

Key Theoretical Results:

Global Convergence: If $\alpha_k = O(1/k)$ , then $\liminf_{k\rightarrow\infty} \mathbb{E}\|\mathcal{G}_{\alpha_k}(x_k)\|^2 = 0$ in general (possibly nonconvex) cases, where $\mathcal{G}_\eta(x)$ is the proximal-gradient mapping.
Linear Rate for Strong Convexity: If $f$ is $\mu$ -strongly convex and $\alpha_k \equiv \alpha < \min\{1/(2\mu),1/L\}$ , the following holds:

$\mathbb{E}[F(x_{k+1}) - F^*] \leq (1-2\alpha\mu)^{\kappa_P(k)}[F(x_0)-F^*] + \frac{L C^2}{2\mu}\alpha,$

where $\kappa_P(k)$ counts the cumulative Prox-SG Steps and $C$ is determined by the starting level set.

Support Identification and OBProx-SG $^+$ : When $N_P<\infty, N_O=\infty$ , under mild local convexity and proper initialization, Orthant Steps alone can drive the proximal mapping norm to zero once the correct orthant is identified.
High-Probability Support Identification: In the $\mu$ -strongly convex case, after $K = O(\log((F(x_0)-F^*)/poly(\tau,\delta_1,\alpha,\text{batch}))/\log(1-2\mu\alpha))$ Prox-SG steps, the iterate is in a neighborhood of the correct support with high probability.

OBProx-SG is contrasted with prevalent stochastic schemes for $\ell_1$ -regularized problems:

Method	Sparsity Promotion	Convergence Rate/Cost
Prox-SG	Shrinkage region $[-\alpha\lambda,+\alpha\lambda]$ (moderate)	$O(1/\sqrt{k})$ or linear for strong convexity; slow
RDA	Averaged gradient enlarges truncation region (aggressive)	More aggressive sparsity, but slower convergence
Prox-SVRG	Moderate (like Prox-SG)	Linear in convex case; needs full-gradient per epoch (costly)
OBProx-SG	Orthant face projection, much larger “zero region” (aggressive)	Linear in Prox-SG steps; one mini-batch per iteration

The Orthant Step in OBProx-SG features a substantially larger zero region for each positive coordinate, leading to aggressive fabrication of sparsity at a low computational burden per iteration (Chen et al., 2020).

5. Empirical Evaluation

Empirical studies evaluate OBProx-SG and OBProx-SG $^+$ on convex and nonconvex $\ell_1$ -regularized problems:

5.1 Convex Case

Datasets: a9a, higgs, kdda, news20, real-sim, rcv1, url_combined, w8a; $\lambda=1/N$ .
OBProx-SG and OBProx-SG $^+$ achieve objective values on par with Prox-SG/Prox-SVRG and outperform RDA.
Solution density (fraction of nonzeros): Prox-SG $80$– $99\%$ , RDA $40$– $90\%$ , Prox-SVRG $3$– $30\%$ , OBProx-SG $0.1$– $80\%$ , OBProx-SG $^+$ up to $2\times$ lower than OBProx-SG.
Runtime: OBProx-SG, Prox-SG, and RDA are comparable; Prox-SVRG incurs $2$– $3\times$ higher computational cost.

5.2 Nonconvex Case

Tasks: $\ell_1$ -regularized MobileNetV1 and ResNet18 on CIFAR-10 and Fashion-MNIST.
OBProx-SG and Prox-SG/Prox-SVRG obtain statistically equivalent objective values and test accuracy (within $1$– $2\%$ ), with RDA showing degraded performance.
Density: Prox-SG $5$– $15\%$ nonzeros, Prox-SVRG/RDA $>30\%$ , OBProx-SG $2$– $10\%$ , OBProx-SG $^+$ $0.3$– $3\%$ —leading to up to $20\times$ sparser networks without loss in accuracy.
Sparsity evolution reveals a rapid density decrease post transition to Orthant Steps.

6. Implementation Considerations and Theoretical Rationale

Prox-SG analysis ensures expected decrease in $F(x)$ , yielding convergence to vanishing mapping norm or to a neighborhood under strong convexity.
Once $\mathcal{O}_k$ covers the true solution support, the problem is smooth using the Orthant Step, which converges (with decaying step sizes) to the global or stationary point with the same support.
High probability of correct orthant identification is established via concentration inequalities under strong convexity.
Efficient implementation is facilitated: track only the current sign vector $s = \operatorname{sign}(x_k)$ and nonzero index set $\mathcal{I}^{\neq 0}(x_k)$ . In the Orthant Step, only the relevant gradient needs computation; coordinates outside the orthant are projected to zero.
The modulo switching schedule is lightweight, and both subroutines leverage the same mini-batch sampling, gradient, and projection logic.

7. Synthesis and Impact

OBProx-SG demonstrates unified globalization of stochastic proximal gradient methods with aggressive support identification via orthant projection. This design enables provable convergence to global optima under convexity (or stationary points for nonconvex $f$ ), linear rates under strong convexity, and empirical reduction in iterate density, all with low per-iteration complexity. In both convex and deep learning applications (e.g., MobileNetV1, ResNet18), OBProx-SG achieves markedly improved sparsity over established methods without degrading objective value or predictive performance (Chen et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Orthant Based Proximal Stochastic Gradient Method for $\ell_1$-Regularized Optimization (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Orthant-Based Methods (OBProx-SG).