Papers
Topics
Authors
Recent
Search
2000 character limit reached

Orthant-Based Proximal SG (OBProx-SG)

Updated 29 January 2026
  • The paper presents OBProx-SG, a method that alternates proximal stochastic gradient and orthant projection steps to aggressively promote sparsity in l1-regularized optimization.
  • It establishes global convergence under nonconvexity and linear convergence in strongly convex settings through a carefully designed modulo switching schedule.
  • Empirical evaluations demonstrate that OBProx-SG significantly reduces model density while maintaining predictive accuracy, outperforming traditional methods.

Orthant-Based Proximal Stochastic Gradient methods (OBProx-SG) address the efficient solution of 1\ell_1-regularized optimization problems, which arise in domains such as feature selection and model compression. The canonical form is minxF(x)=f(x)+λx1\min_x F(x) = f(x) + \lambda\|x\|_1, where f(x)=1Ni=1Nfi(x)f(x) = \frac{1}{N} \sum_{i=1}^N f_i(x) is an average over smooth—possibly nonconvex—component losses, and the 1\ell_1-regularization promotes sparsity. OBProx-SG unifies the global convergence properties of proximal stochastic gradient approaches with aggressive sparsity promotion via orthant projection, yielding solutions with substantially reduced support while maintaining per-iteration cost comparable to standard stochastic proximal methods (Chen et al., 2020).

1. Problem Setting and Motivation

The primary objective is to solve

minxF(x)=f(x)+λx1,f(x)=1Ni=1Nfi(x),λ>0,\min_x F(x) = f(x) + \lambda\|x\|_1, \quad f(x) = \frac{1}{N}\sum_{i=1}^N f_i(x), \quad \lambda > 0,

where fi:RnRf_i : \mathbb{R}^n \to \mathbb{R} are smooth and N,n1N, n \gg 1. In the convex scenario, each fif_i is convex and LL-smooth. If ff is nonconvex, it is assumed to be continuously differentiable with Lipschitz gradients within a compact level set, and stochastic gradients possess bounded variance. The 1\ell_1-term induces sparsity by shrinking coefficients toward zero, which is crucial in high-dimensional settings for interpretability and computational efficiency.

2. OBProx-SG Algorithmic Structure

OBProx-SG alternates between two subroutines—a Proximal Stochastic Gradient (Prox-SG) Step and an Orthant Step—using a simple modulo switching schedule determined by user-specified integers NPN_P and NON_O. The control flow is as follows:

  • Input: Initial point x0Rnx_0 \in \mathbb{R}^n, step size α0>0\alpha_0 > 0, NPN_P, NO>0N_O > 0.
  • For k=0,1,2,k = 0,1,2,\dots:
    • If (kmod(NP+NO))<NP(k \mod (N_P + N_O)) < N_P, perform a Prox-SG Step;
    • Else, perform an Orthant Step;
    • Update step size αk+1\alpha_{k+1}.

2.1 Prox-SG Step

Given xkx_k and step size αk\alpha_k, sample a mini-batch Bk\mathcal{B}_k and compute the stochastic gradient gk=fBk(xk)g_k = \nabla f_{\mathcal{B}_k}(x_k). The update is

xk+1=proxαkλ1(xkαkgk).x_{k+1} = \operatorname{prox}_{\alpha_k\lambda\|\cdot\|_1}\left(x_k - \alpha_k g_k\right).

This amounts to a coordinate-wise soft-thresholding operation promoting sparsity, with

[xk+1]i={[xkαkgk]iαkλif [xkαkgk]i>αkλ, 0if [xkαkgk]iαkλ, [xkαkgk]i+αkλif [xkαkgk]i<αkλ.[x_{k+1}]_i = \begin{cases} [x_k - \alpha_k g_k]_i - \alpha_k \lambda & \text{if } [x_k - \alpha_k g_k]_i > \alpha_k\lambda, \ 0 & \text{if } |[x_k - \alpha_k g_k]_i| \leq \alpha_k\lambda, \ [x_k - \alpha_k g_k]_i + \alpha_k \lambda & \text{if } [x_k - \alpha_k g_k]_i < -\alpha_k\lambda. \end{cases}

2.2 Orthant Step

Define sets I0(x)\mathcal{I}^0(x), I+(x)\mathcal{I}^+(x), I(x)\mathcal{I}^-(x), with xx restricted to the spanning orthant face

Ok={x:sign(xi)=sign([xk]i) or xi=0,i},\mathcal{O}_k = \{x : \operatorname{sign}(x_i) = \operatorname{sign}([x_k]_i) \text{ or } x_i = 0, \forall i\},

where coordinates in I0(xk)\mathcal{I}^0(x_k) are fixed to zero and those in I0(xk)\mathcal{I}^{\neq 0}(x_k) keep their sign. On Ok\mathcal{O}_k, the objective simplifies to a smooth function F~(x)=f(x)+λsign(xk)x\widetilde F(x) = f(x) + \lambda \operatorname{sign}(x_k)^\top x.

The step comprises:

  • Stochastic gradient computation for F~\widetilde F:

g^k=1BkiBk(fi(xk)+λsign(xk))\hat{g}_k = \frac{1}{|\mathcal{B}_k|}\sum_{i \in \mathcal{B}_k} (\nabla f_i(x_k) + \lambda \operatorname{sign}(x_k))

  • Gradient descent update: x^=xkαkg^k\hat x = x_k - \alpha_k \hat{g}_k.
  • Orthant projection: for each coordinate,

[xk+1]i={[x^]iif sign([x^]i)=sign([xk]i), 0otherwise.[x_{k+1}]_i = \begin{cases} [\hat x]_i & \text{if } \operatorname{sign}([\hat x]_i) = \operatorname{sign}([x_k]_i), \ 0 & \text{otherwise}. \end{cases}

The switching schedule is governed by NPN_P, NON_O; OBProx-SG+^+ is a variant with NP<N_P < \infty, NO=N_O = \infty, i.e., finitely many Prox-SG Steps followed by only Orthant Steps.

3. Convergence Guarantees

Analysis proceeds under the condition that ff has LL-Lipschitz stochastic gradients, is bounded below, and that F~B(x)\nabla\widetilde F_{\mathcal{B}}(x) is uniformly bounded and unbiased with variance σ2\sigma^2.

Key Theoretical Results:

  • Global Convergence: If αk=O(1/k)\alpha_k = O(1/k), then lim infkEGαk(xk)2=0\liminf_{k\rightarrow\infty} \mathbb{E}\|\mathcal{G}_{\alpha_k}(x_k)\|^2 = 0 in general (possibly nonconvex) cases, where Gη(x)\mathcal{G}_\eta(x) is the proximal-gradient mapping.
  • Linear Rate for Strong Convexity: If ff is μ\mu-strongly convex and αkα<min{1/(2μ),1/L}\alpha_k \equiv \alpha < \min\{1/(2\mu),1/L\}, the following holds:

E[F(xk+1)F](12αμ)κP(k)[F(x0)F]+LC22μα,\mathbb{E}[F(x_{k+1}) - F^*] \leq (1-2\alpha\mu)^{\kappa_P(k)}[F(x_0)-F^*] + \frac{L C^2}{2\mu}\alpha,

where κP(k)\kappa_P(k) counts the cumulative Prox-SG Steps and CC is determined by the starting level set.

  • Support Identification and OBProx-SG+^+: When NP<,NO=N_P<\infty, N_O=\infty, under mild local convexity and proper initialization, Orthant Steps alone can drive the proximal mapping norm to zero once the correct orthant is identified.
  • High-Probability Support Identification: In the μ\mu-strongly convex case, after K=O(log((F(x0)F)/poly(τ,δ1,α,batch))/log(12μα))K = O(\log((F(x_0)-F^*)/poly(\tau,\delta_1,\alpha,\text{batch}))/\log(1-2\mu\alpha)) Prox-SG steps, the iterate is in a neighborhood of the correct support with high probability.

OBProx-SG is contrasted with prevalent stochastic schemes for 1\ell_1-regularized problems:

Method Sparsity Promotion Convergence Rate/Cost
Prox-SG Shrinkage region [αλ,+αλ][-\alpha\lambda,+\alpha\lambda] (moderate) O(1/k)O(1/\sqrt{k}) or linear for strong convexity; slow
RDA Averaged gradient enlarges truncation region (aggressive) More aggressive sparsity, but slower convergence
Prox-SVRG Moderate (like Prox-SG) Linear in convex case; needs full-gradient per epoch (costly)
OBProx-SG Orthant face projection, much larger “zero region” (aggressive) Linear in Prox-SG steps; one mini-batch per iteration

The Orthant Step in OBProx-SG features a substantially larger zero region for each positive coordinate, leading to aggressive fabrication of sparsity at a low computational burden per iteration (Chen et al., 2020).

5. Empirical Evaluation

Empirical studies evaluate OBProx-SG and OBProx-SG+^+ on convex and nonconvex 1\ell_1-regularized problems:

5.1 Convex Case

  • Datasets: a9a, higgs, kdda, news20, real-sim, rcv1, url_combined, w8a; λ=1/N\lambda=1/N.
  • OBProx-SG and OBProx-SG+^+ achieve objective values on par with Prox-SG/Prox-SVRG and outperform RDA.
  • Solution density (fraction of nonzeros): Prox-SG $80$–99%99\%, RDA $40$–90%90\%, Prox-SVRG $3$–30%30\%, OBProx-SG $0.1$–80%80\%, OBProx-SG+^+ up to 2×2\times lower than OBProx-SG.
  • Runtime: OBProx-SG, Prox-SG, and RDA are comparable; Prox-SVRG incurs $2$–3×3\times higher computational cost.

5.2 Nonconvex Case

  • Tasks: 1\ell_1-regularized MobileNetV1 and ResNet18 on CIFAR-10 and Fashion-MNIST.
  • OBProx-SG and Prox-SG/Prox-SVRG obtain statistically equivalent objective values and test accuracy (within $1$–2%2\%), with RDA showing degraded performance.
  • Density: Prox-SG $5$–15%15\% nonzeros, Prox-SVRG/RDA >30%>30\%, OBProx-SG $2$–10%10\%, OBProx-SG+^+ $0.3$–3%3\%—leading to up to 20×20\times sparser networks without loss in accuracy.
  • Sparsity evolution reveals a rapid density decrease post transition to Orthant Steps.

6. Implementation Considerations and Theoretical Rationale

  • Prox-SG analysis ensures expected decrease in F(x)F(x), yielding convergence to vanishing mapping norm or to a neighborhood under strong convexity.
  • Once Ok\mathcal{O}_k covers the true solution support, the problem is smooth using the Orthant Step, which converges (with decaying step sizes) to the global or stationary point with the same support.
  • High probability of correct orthant identification is established via concentration inequalities under strong convexity.
  • Efficient implementation is facilitated: track only the current sign vector s=sign(xk)s = \operatorname{sign}(x_k) and nonzero index set I0(xk)\mathcal{I}^{\neq 0}(x_k). In the Orthant Step, only the relevant gradient needs computation; coordinates outside the orthant are projected to zero.
  • The modulo switching schedule is lightweight, and both subroutines leverage the same mini-batch sampling, gradient, and projection logic.

7. Synthesis and Impact

OBProx-SG demonstrates unified globalization of stochastic proximal gradient methods with aggressive support identification via orthant projection. This design enables provable convergence to global optima under convexity (or stationary points for nonconvex ff), linear rates under strong convexity, and empirical reduction in iterate density, all with low per-iteration complexity. In both convex and deep learning applications (e.g., MobileNetV1, ResNet18), OBProx-SG achieves markedly improved sparsity over established methods without degrading objective value or predictive performance (Chen et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Orthant-Based Methods (OBProx-SG).