Sketching for Regularized Optimization

Updated 24 January 2026

SRO is a class of randomized algorithms that project high-dimensional data into lower-dimensional subspaces to accelerate regularized optimization.
It supports both convex and nonconvex regularizers, offering efficient solutions for least squares, regression, and inverse problems with rigorous error guarantees.
Key modalities include sketch-and-solve, sketched preconditioning, iterative methods, and distributed processing, ensuring near-minimax statistical efficiency.

Sketching for Regularized Optimization (SRO) refers to a class of randomized algorithms that accelerate regularized optimization—especially in large-scale least squares, regression, and inverse problems—by projecting the data, residuals, or gradients into lower-dimensional subspaces via random "sketches." The objective is to solve regularized optimization problems with high dimensionality or ill-conditioned design using computational resources proportional to an intrinsic complexity parameter, such as the effective dimension or statistical dimension, rather than the ambient data size. SRO frameworks encompass both convex and nonconvex regularizers, support adaptive and distributed computation, and yield rigorous error guarantees. Key algorithmic modalities include direct sketch-and-solve, sketched preconditioning, iterative and distributed sketching, and model averaging. The structure and theoretical analysis of SRO span fast solvers for Tikhonov/ridge regression, sparse learning (including $\ell_1$ , $\ell_p$ , and logistic problems), and sketch-aware second-order methods.

1. Mathematical Foundations and Problem Formulations

Regularized optimization problems are typically posed as minimizing an objective of the form

$F(w) = \frac{1}{2} \|Xw - y\|_2^2 + P(w),$

where $X\in\mathbb{R}^{n\times d}$ encodes features, $y\in\mathbb{R}^n$ denotes measurements, and $P(w)$ is a penalty or regularizer, which may be convex (e.g., $\ell_2$ , $\ell_1$ ) or nonconvex (e.g., SCAD, MCP) (Yang et al., 2023). In matrix ridge regression, which generalizes to multi-response and structured prediction, the objective is $f(W) = \frac{1}{n}\|XW - Y\|_F^2 + \gamma\|W\|_F^2$ (Wang et al., 2017).

SRO introduces a random sketching matrix $S\in\mathbb{R}^{m\times n}$ , $m\ll n$ , drawn from an $\ell_2$ subspace-embedding family (such as Gaussian, SRHT, CountSketch) and replaces either the data, Hessian, or gradient by their sketched versions. The resulting sketched problems are solvable with considerably reduced computational and memory demands.

For general convex/nonconvex regularizers, SRO operates on the sketched objective

$\tilde F(w) = \frac{1}{2} \|S(Xw - y)\|_2^2 + P(w),$

admitting error bounds and minimax estimation rates under mild assumptions (Yang et al., 2023). In iterative schemes (e.g., Iterative Hessian Sketch), sketching and optimization steps are alternated, reducing the error with geometric contraction at each round (Wang et al., 2022).

2. Sketching Constructions and Complexity Parameters

Sketching methods used in SRO includes

Subgaussian random projections: Dense Gaussian, Bernoulli (Chen et al., 2020).
Structured transforms: SRHT, CountSketch, OSNAP (Avron et al., 2016, Lacotte et al., 2020).
Multi-level hash-based: CountMin sketches for $\ell_1$ and logistic regression (Munteanu et al., 2023).
Tensor-structured sketches: Row-wise tensorized sub-Gaussian matrices compatible with tensor product structures (Chen et al., 2020).

A critical insight of SRO is that the sketching dimension $m$ need not scale with $n$ or $d$ , but with the "effective dimension" or "statistical dimension"

$d_{\mathrm{eff}}(\lambda) = \mathrm{Tr}(X^\top X (X^\top X + \lambda I)^{-1}),$

which quantifies the information content after regularization (Lacotte et al., 2021, Avron et al., 2016, Lin et al., 2018). For kernel and Hilbert-space problems, analogous capacity numbers $N(\lambda)$ and source conditions govern sketch sizes and convergence rates (Lin et al., 2018).

Key complexity parameters are summarized below:

Problem class	Complexity scaling	Sketch size $m$
Ridge regression	Statistical dimension $s_\lambda(A)$	$O(s_\lambda/\epsilon)$
Low-rank/RPCA/CCA	Min of rank $k$ or $s_\lambda$	$O(k/\epsilon), O(s_\lambda/\epsilon)$
General convex constraints	Gaussian width, M-complexity	$O(M^2/\epsilon^2)$
$\ell_1$ , logistic regression	Data complexity $\mu$ , ambient $d$	$\tilde O(d^{1+c})$ , $\tilde O(\mu d^{1+c})$
Kernel/HS regression	Capacity $N(\lambda)$	$O(N(\lambda)\log N(\lambda))$

3. Algorithmic Modalities: Sketch-and-Solve, Preconditioning, Iterative and Distributed SRO

Direct sketch-and-solve

Solve the regularized problem on sketched data (classical sketch), optionally with constraints or general regularizers; best when $m\ll n$ (Yang et al., 2023, Avron et al., 2016, Chen et al., 2020). For unconstrained least squares, this achieves $(1\pm \epsilon)$ -approximation at sketch size scaling with rank or effective dimension.

Sketched preconditioning

Use sketches to build a preconditioner for iterative solvers (e.g., LSQR, CG), yielding $O(\log(1/\epsilon))$ convergence independent of condition number (Meier et al., 2022, Lacotte et al., 2020, Lacotte et al., 2021). Inner–outer schemes apply sketch-to-precondition options for IRN/LSQR solvers in ill-posed or $\ell_1$ -type problems (Landman et al., 13 Oct 2025).

Iterative Hessian Sketch (IHS) and adaptive schemes

IHS iteratively solves sketched subproblems, with polynomial contraction and low memory. Adaptive sketching algorithms increase $m$ only as required, tuning to the effective dimension without prior knowledge (Lacotte et al., 2021, Lacotte et al., 2020). Error bounds and complexity are optimal in terms of intrinsic dimensions.

Distributed and debiased SRO

In cluster or federated settings, each worker solves a sketched local subproblem and averages parameters. Standard averaging is biased; surrogate sketching and careful scaling of $\lambda$ yield unbiased Newton steps, with bias and variance decreasing as $O(1/q)$ (Dereziński et al., 2020, Bartan et al., 2022). Determinantal point processes (DPPs) are employed for surrogate sketches that provide exact bias formulas.

Model averaging

Averaging over $g$ sketches reduces statistical risk by nearly $1/g$, driving optimization and statistical error close to unsketched levels in parallel settings (Wang et al., 2017).

Flexible Krylov methods and sketching in sparse/ill-posed settings

Randomized flexible Krylov methods use sketch-and-solve or sketch-to-precondition strategies within iteratively reweighted least squares and MM schemes for $\ell_1/\ell_p$ regularization and sparse recovery (Landman et al., 13 Oct 2025).

4. Theoretical Guarantees: Error Bounds, Rates, and Trade-offs

SRO frameworks provide rigorous approximation, statistical, and computational guarantees:

Relative error: For convex regularizers, function-gap and parameter estimation obey $F(\tilde w)-F(w^*)\leq \epsilon F(w^*)$ , with $\|\tilde w-w^*\|_X\leq\epsilon/(1-\epsilon)\|w^*\|_X$ (Yang et al., 2023).
Minimax rates: Sparse convex/nonconvex estimators via SRO match $\sqrt{s\log d/n}$ rates under restricted-eigenvalue conditions (Yang et al., 2023).
Bias–variance analysis: Classical sketch inflates variance ( $\Theta(n/s)$ ), Hessian sketch increases bias (unless regularization is large), model averaging and sufficient sketch size mitigate both (Wang et al., 2017).
Distributed/unbiased averaging: With surrogate sketches and proper $\lambda'$ , the average parameter is unbiased and converges at same rate as full Newton (Dereziński et al., 2020, Bartan et al., 2022).
Iterative and adaptive error contraction: Iterative SRO achieves geometric error reduction per step; adaptive algorithms automatically match effective dimension (Lacotte et al., 2020, Lacotte et al., 2021).
Optimal learning rates: Sketched kernel/Hilbert-space regression achieves $n^{-(\zeta-a)/(2\zeta+\gamma)}$ rates without saturation, provided $m\simeq$ statistical capacity, matching minimax efficiency (Lin et al., 2018).

5. Extensions: Structured Data, Nonconvex Regularization, and Specialized Domains

SRO generalizes to

Tensor-structured sketching: Exploits low-dimensional tensor product/kronecker structures for massive acceleration and low overhead; sketching dimension bound scales with natural Gaussian width/M-complexity of the constraint cone (Chen et al., 2020).
Nonconvex regularizers: Even for Frechet-subdifferentiable, nonconvex $P$ , SRO still yields controlled approximation error, provided subgradient Lipschitzness and spectral gap (Yang et al., 2023).
Image sketching and inverse problems: Multiresolution stochastic sketching (e.g., ImaSk in tomographic reconstruction) replaces per-iteration full image operations with random domain sketches, resulting in linear convergence and improved wall-clock performance (Perelli et al., 2024).
Regularization paths: Sketching the Krylov subspace enables efficient computation of ridge regularization paths for multiple $\lambda$ values simultaneously, using binomial decomposition on the sketch basis (Wang et al., 2022).
Primal–dual and variance reduction: SEGA sketch-and-project schemes generalize stochastic gradient methods through variance-reduced unbiased gradient estimation (Hanzely et al., 2018).

6. Practical Guidelines and Empirical Observations

Best practices and empirical insights extracted from the literature include:

Sketch size selection: For optimization accuracy $\epsilon$ , take $m\approx O(d_\mathrm{eff}/\epsilon^2)$ ; in kernel/Hilbert problems, use $m\approx \text{capacity}\times\log \text{capacity}$ (Lin et al., 2018, Lacotte et al., 2020).
Regularization tuning: To control bias (especially in Hessian/sketched Newton), ensure $n\gamma \gg \|Y\|^2$ , and adjust $\lambda'$ in distributed settings as $\lambda'(1-d_\lambda/m)$ (Wang et al., 2017, Dereziński et al., 2020).
Adaptive sketching: Start with $m=1$ , double when progress stalls until target contraction; this discovers $d_\mathrm{eff}$ without prior knowledge (Lacotte et al., 2021, Lacotte et al., 2020).
Distributed/parallel computation: Use surrogate sketches and debiasing for unbiased parameter averaging; combine with model averaging for risk control (Bartan et al., 2022, Dereziński et al., 2020).
Sparse/recovery problems: For constraints promoting sparsity, sketching dimension scales as $O(s^2\log^2(p/s)/\epsilon^2)$ for $s$ -sparse signals (Chen et al., 2020).
Multiresolution imaging: Randomized multilevel image-domain sketches (e.g., block-averaging/downsampling) efficiently trade memory for reduced per-iteration cost, retaining fast convergence (Perelli et al., 2024).

7. Impact, Limitations, and Ongoing Directions

SRO methods have demonstrated substantial speedups (up to $O(nd_\mathrm{eff})$ ) and reductions in memory, while preserving statistical accuracy and approximation guarantees in convex and nonconvex regularized optimization (Avron et al., 2016, Wang et al., 2022, Landman et al., 13 Oct 2025). Ongoing areas of investigation include more efficient sketch constructions leveraging data structure, improved nonconvex analysis, robust adaptive mechanisms, extensions to online and streaming inference, and further empirical validation in scientific imaging and large-scale machine learning (Perelli et al., 2024, Munteanu et al., 2023).

The principal limitation is the need to guarantee unbiasedness or sufficiently bounded error distortion for more complex regularizers and constraint geometries. Both statistical and computational complexity of the sketched problem hinge on accurate estimation of intrinsic dimensions or structural parameters of the data and regularizer.

In summary, SRO is a unifying paradigm for randomized, scalable regularized optimization, enabling near-minimax statistical efficiency and optimal computational complexity across a broad spectrum of problem domains and regularizer choices.