Subspace-Guided Optimization

Updated 21 January 2026

Subspace-guided optimization is a methodology that decomposes high-dimensional problems into adaptive low-dimensional subspaces for efficient computation.
It employs various strategies—ranging from historical gradients to random embeddings and reinforcement learning—to select informative subspaces.
These techniques enhance scalability and robustness in applications such as Bayesian optimization, continual learning, and large-model fine-tuning.

Subspace-guided optimization is a class of methodologies that decompose high- or infinite-dimensional optimization problems into a sequence of low-dimensional subproblems, each solved within an adaptively chosen subspace. This paradigm is motivated by both computational efficiency—enabling second-order or derivative-free algorithms to be run at scale—and by the observation that many large-scale objectives exhibit a low effective intrinsic dimension, or concentrate their curvature and descent directions in specific subspaces. Over the last decade, this strategy has permeated unconstrained optimization, Bayesian optimization, continual learning, constrained and saddle-point optimization, and large-model fine-tuning, leading to substantial advances in efficiency, robustness, and sample complexity.

1. Core Principles and Mathematical Formulation

Subspace-guided optimization, in its canonical form, replaces the traditional update rule for minimizing $f(x)$ over $x\in\mathbb{R}^n$ ,

$x_{k+1} = x_k - \eta_k \nabla f(x_k),$

with a two-stage process:

At iteration $k$ , construct a subspace $P_k\in\mathbb{R}^{n\times d}$ ( $d\ll n$ ), whose columns define $d$ search directions.
Solve the subspace subproblem:

$\alpha_k = \arg\min_{\alpha\in\mathbb{R}^d} f(x_k + P_k \alpha),$

and update $x_{k+1} = x_k + P_k \alpha_k$ .

This basic template admits immense flexibility. The subspace $P_k$ may be fixed or chosen adaptively (via recent steps, gradients, SVDs of stale curvature, or learning-based policies), and the subspace problem may engage first- or second-order solvers, or even be solved zeroth-order. Critically, the cost per outer iteration is $O(n d)$ , and the inner mechanics (e.g., BFGS on the subspace, or L-BFGS-style memory) can efficiently harness curvature or history (Choukroun et al., 2021).

Variants include random embedding for dimension reduction in black-box optimization (Ngo et al., 2024, Shilton et al., 2020, Zhan et al., 2024), alternating subspaces for manifold optimization (Yu et al., 25 Jan 2025), orthogonality constraints for continual learning (Cheng et al., 17 May 2025), and engaging dynamic or probabilistic subspace selection (Menickelly, 2024, Choukroun et al., 2021).

2. Algorithmic Frameworks and Meta-optimization

A key distinction among subspace-guided strategies is between classical, rule-based subspace determination and meta-learned or reinforcement-driven subspace policies.

Rule-based and Sliding-window Policies: Early and classical schemes (SESOP, ORTH, CG) utilize subspaces made up of recent gradients, steps, or search directions, with updates such as FIFO removal or selection based on the relative magnitude of coefficients from the last subspace step. For example, removing the subspace direction with the smallest $|\alpha_k^i|$ exploits the empirical importance of previous directions (Choukroun et al., 2021, Richardson et al., 2016).
Meta Subspace Optimization (MSO): The MSO framework parameterizes the subspace update policy as either a rule or a small learned module $\pi(S_k, \Omega_k)$ , where $S_k$ is the history of $d-1$ step-vectors and $\Omega_k$ can be (for instance) the latest subspace coefficients $\alpha_k$ . The meta-policy determines, at each step, which directions to discard and which new ones to accept, facilitating targeted adaptation to the geometry of $f$ (Choukroun et al., 2021).
Reinforcement Learning (RL) Policies: MSO generalizes the subspace update as a Markov Decision Process, where the state is a stack of recent $\alpha$ -vectors, action is a subspace column to prune, and reward is the instantaneous relative function value decrease. The RL policy is parameterized (e.g., via an MLP) and trained by REINFORCE, achieving meta-adaptive removal patterns that surpass both fixed and rule-based competitors (Choukroun et al., 2021).
Bandit-augmented Subspace Selection: Recent approaches recast the subspace direction selection as a linear bandit problem, maximizing lower bounds on expected descent in a dynamic gradient field. This enables systematic exploration/exploitation tradeoffs and provable dynamic regret guarantees, especially in directional derivative–limited or derivative-free contexts (Menickelly, 2024).
Continual Learning via Sequential Orthogonal Subspaces: In continual learning, as in CoSO, new tasks are learned by projecting updates into a dynamically determined sequence of low-rank subspaces orthogonal to historical update spaces, with subspace bases updated via SVD and Frequent Directions (Cheng et al., 17 May 2025).

3. Subspace Construction: Deterministic, Random, and Data-driven

Various strategies exist for constructing effective subspaces:

Historical (Deterministic) Subspaces: The basis consists of recent gradients, steps, or Hessian-vector products, possibly orthogonalized (Gram-Schmidt) and managed with a fixed horizon (window). This facilitates implicit preconditioning and memory accumulation (modeled after SESOP/SEBOOST) (Choukroun et al., 2021, Richardson et al., 2016).
Random and Sketch-based Subspaces: To mitigate the cost of full gradient or Hessian computation, subspaces may be defined as random projections of the ambient space (e.g., Johnson-Lindenstrauss sketches). This approach is especially prominent in derivative-free large-scale optimization and allows provable norm/angle preservation for key vectors (gradients, active constraints) (Nozawa et al., 2023, Miyaishi et al., 2024, Cartis et al., 2024).
Functionally-aligned or Data-driven Subspaces: In Bayesian functional optimization, subspaces are defined by draws from a prior GP that encode experimenter intuition (length scale, smoothness). The sequential, affine subspace centers at the current best and spans functions likely to contain the optimum (Shilton et al., 2020). In measure-transport and sOED, likelihood-informed subspaces are computed from Fisher information matrices or Jacobians to accelerate high-dimensional amortized inference (Cui et al., 27 Feb 2025).
SVD/Subspace-from-Gradient Spectrum: For continual learning and parameter-efficient fine-tuning, subspaces are computed as the leading singular vectors of accumulated gradients or parameter updates, concentrating adaptation in highly informative directions, e.g., as in CoSO and PESO-LoRA (Cheng et al., 17 May 2025, Lou et al., 1 Dec 2025).
Problem-specific or Domain-encoded Subspaces: In Riemannian and manifold optimization, row and column subspaces of gradient matrices are separately adapted using coordinatewise statistics (e.g., via LSTMs), yielding shape-agnostic and memory-light meta-optimizers (Yu et al., 25 Jan 2025).

4. Theoretical Guarantees and Convergence Rates

Subspace-guided approaches provide rigorous theoretical guarantees, contingent on model class and subspace selection policy.

Linear and Second-order Convergence: Under strong convexity and sufficient subspace alignment (often realized with random projections), subspace Newton/quasi-Newton and trust-region methods match the optimal rates of their full-dimensional analogs (e.g., $O(1/\sqrt{k})$ stationary violation or $O(1/k)$ in convex settings), with dimension dependence scaling with the subspace size (Cartis et al., 2024, Miyaishi et al., 2024, Nozawa et al., 2023, Li et al., 2020).
Dynamic Regret and RL Policies: Bandit-augmented subspace choice strategies achieve sublinear dynamic regret under slow drift of gradients and bounded reward, with finite-memory schemes giving $O(K^{7/8})$ regret (per-iteration optimality gap vanishes) (Menickelly, 2024).
Sublinear Regret and Coverage: In functional Bayesian optimization, sequential subspaces guarantee sublinear regret provided the effective manifold dimension is finite and the random draws (from the prior) eventually cover it. This ensures that after a sufficient number of subspace searches, the optimum is closely approached (Shilton et al., 2020).
Dimension-free Meta-optimization: In MSO, all meta-decision steps (RL states/actions, rule-based heuristics) are posed in $O(d)$ or $O(d h)$ dimensions, invariant to the ambient $n$ . Learning and search in this meta-space scales independently of problem size (Choukroun et al., 2021).

5. Applications and Empirical Performance

Subspace-guided optimization is deployed across a wide range of applications, often with state-of-the-art empirical results.

Classical and Deep Learning: On ill-conditioned quadratic and nonlinear problems, MSO outperforms SESOP, conjugate gradient, and ORTH by 5–15% in convergence speed; RL-enhanced MSO further improves by 5–10%. In MNIST and robust regression, learned policies generalize across tasks and network scales, matching or exceeding hand-tuned baselines and removing the need for learning rate tuning (Choukroun et al., 2021, Richardson et al., 2016).
High-Dimensional Bayesian Optimization: BOIDS and sequential subspace Bayesian optimization demonstrate that appropriately managed line-based or random-functional subspaces can dramatically accelerate optimization in $d=100$ –$500$ dimensions—often outperforming state-of-the-art batch or dimensionality-reduction BO (e.g., LineBO, BAxUS, ALEBO, SAASBO). Adaptive subspace adjustment aligns with the problem’s intrinsic dimension, leading to near-linear scaling in batch size and robustness to parameterization (Ngo et al., 2024, Shilton et al., 2020, Zhan et al., 2024).
Continual and Lifelong Learning: CoSO sequentially augments its null-space basis of historical gradient directions to guarantee zero interference during continued task learning. Ablations confirm that strict subspace orthogonality is essential, and empirical results on ImageNet-R show an increase in retained performance of 2.8% over the strongest fixed low-rank adaption baseline (Cheng et al., 17 May 2025).
Parameter-Efficient Model Adaptation: In LLM fine-tuning, PESO and its LoRA instantiations treat parameter-efficient adaptation as an exploration–exploitation process in sequentially updated low-rank subspaces. PESO-LoRA closes the gap to full fine-tuning within LoRA’s memory budget, outperforming all tested LoRA variants (Lou et al., 1 Dec 2025).
Saddle-point and Constrained Problems: In large-scale min–max and constrained optimization, sequential subspace methods using both primal and dual directions (PD-SESOP) result in order-of-magnitude faster convergence and robustness on bi-linear games, ADMM problems, and GANs than first-order competitors (Choukroun et al., 2020). For general constraints, randomized subspace projection (via Gaussian random embedding) enables large step sizes and $\tilde O((n/d)\epsilon^{-2})$ complexity with only $O(d)$ directional gradients per iteration (Nozawa et al., 2023).
Derivative-free and “Black-box” Regimes: Randomized subspace model-based methods (e.g., RSDFO-Q, ZO-SAH) deliver complexity nearly matching full-space DFO but with much lower dimension dependence. Low-dimensional quadratic modeling within sequential random subspaces allows scaling to $n\sim10^3$ – $10^4$ , achieving superior accuracy and convergence on benchmark suites (Cartis et al., 2024, Kim et al., 8 Jul 2025, Xie et al., 2023).
Meta-optimization and Manifold Problems: Subspace-structured meta-optimizers enable scalable and memory-efficient learning of Riemannian updates, with orders-of-magnitude lower memory cost than full-matrix meta-learners—enabling deployment in large neural architectures and on hardware-constrained devices (Yu et al., 25 Jan 2025).

6. Practical Considerations and Limitations

While subspace-guided optimization offers scalability and robustness, careful design choices are required:

Subspace dimension trade-offs: Larger subspaces capture more curvature or history but increase per-iteration cost; moderate dimensions (e.g., $d=10$ –$100$) typically suffice except on highly ill-conditioned problems (Choukroun et al., 2021, Cartis et al., 2024).
Subspace selection policy: Both deterministic and learned schemes must robustly identify directions that encode both descent and critical curvature; failure leads to slow convergence or ill-conditioning.
Hyperparameter tuning: Choices such as the frequency of subspace update, regularization strengths, and SVD ranks must be cross-validated or annealed for best performance.
Randomness and adaptation: Random embedding–based methods may fail if the function does not concentrate in a low-dimensional manifold, or if the random subspaces miss critical directions over practical runtime (Cartis et al., 2024, Nozawa et al., 2023).
Applicability: Certain structured problems (e.g., highly nonsmooth objectives or highly nonseparable constraints) may not admit efficient or informative subspace decompositions.

7. Emerging Directions and Theoretical Connections

Recent work suggests several active areas of expansion:

Dynamic and context-aware subspace adaptation, possibly integrating problem structure, RL, and bandit principles.
Measure-transport and amortized inference for Bayesian inverse problems, using subspace acceleration to enable real-time posterior sampling in high $m$ (Cui et al., 27 Feb 2025).
Fine-grained meta-optimization over manifold and submanifold structures via coordinatewise statistics and learned LSTMs (Yu et al., 25 Jan 2025).
Unlearning and interpretability: Projections and constrained optimization within data-driven subspaces (e.g., constructed from sparse autoencoders) enable targeted knowledge removal and robust model editing (Wang et al., 30 May 2025).
Interleaving exploration and exploitation across subspaces, as in PESO or BOIDS, leveraging sequential adaptation to balance global search and local refinement (Lou et al., 1 Dec 2025, Ngo et al., 2024).

Subspace-guided optimization therefore offers a unifying and versatile toolkit, with formal guarantees and empirical efficacy across foundational domains in continuous optimization, Bayesian search, meta-learning, and large-model adaptation. Its continued development is likely to fundamentally shape scalable optimization in the high-dimensional and data-driven era.