Papers
Topics
Authors
Recent
Search
2000 character limit reached

Layer-Projected Coordinate Descent (LPCD)

Updated 3 December 2025
  • LPCD is a block-based optimization method that alternates relaxation and projection steps on variable groups to efficiently solve high-dimensional problems.
  • In deep network classification, LPCD leverages convexity in output layers and gradient-based updates in hidden layers to improve convergence and generalization.
  • LPCD unifies submodule quantization strategies by optimizing and projecting on grouped parameters, outperforming traditional methods in speed and accuracy.

Layer-Projected Coordinate Descent (LPCD) is a family of coordinate-based optimization algorithms that operate on blocks or "layers" of variables, performing joint optimization or projection steps on these blocks rather than on single coordinates. In modern applications, LPCD is used both as a robust method for high-dimensional regression/classification and as a unifying strategy for submodule- or layer-wise quantization in deep neural networks, leveraging convexity, projection techniques, and relaxations for tractable large-scale optimization and quantization tasks (Patel et al., 2020, Ichikawa et al., 1 Dec 2025, Jin et al., 2022).

1. Fundamental Principle and Mathematical Formulation

LPCD generalizes coordinate descent by moving along high-dimensional subspaces or blocks (which may correspond to network layers, submodules, or arbitrary groups of coordinates), optimizing or projecting onto their solution manifolds, then iteratively updating the solution state. If the global objective admits separable or partially convex structure, this approach can realize dramatic computational and convergence benefits.

Generalizing to RR blocks, for each block rr with parameters or variables MrRNr×KrM_r \in \mathbb{R}^{N_r \times K_r} and chosen "feasible set" (e.g., quantization grid or parameter domain) QNr×Kr\mathcal{Q}^{N_r \times K_r}, LPCD alternately "relaxes" (optimizes in a continuous space) and "projects" (maps onto feasible or quantized domain) as follows:

Relaxation step (block rr at iteration tt):

Mr(t)=argminURNr×KrL(,U,)\overline{M}_r^{(t)} = \arg\min_{U \in \mathbb{R}^{N_r \times K_r}} L(\dots, U, \dots)

where LL is the relevant loss (e.g., cross-entropy for classification, mean squared error for quantization fitting).

Projection step:

M^r(t+1)=ΠQ(r)(Mr(t))\widehat{M}_r^{(t+1)} = \Pi_{\mathcal{Q}}^{(r)}(\overline{M}_r^{(t)})

where ΠQ(r)\Pi_{\mathcal{Q}}^{(r)} is a suitable projection/quantization operator (layer-wise, activation-aware, etc.) (Ichikawa et al., 1 Dec 2025).

This alternating block update continues per cycle and per block, generating a monotonic sequence (in practice) of feasible solutions that reduce the global objective.

2. LPCD for Deep Network Classification

LPCD arises naturally in deep networks with a structure that enables partial convexity—specifically, the case of minimizing cross-entropy loss with respect to hidden/unit weights (WhiddenW_\text{hidden}) and output/linear weights (WoutputW_\text{output}):

  • Block 1: Linear/output layer weights
    • Objective for fixed hidden layers: minimize cross-entropy, which is globally convex in WoutputW_\text{output}.
    • Perform an exact Newton step using closed-form gradient and Hessian for cross-entropy loss:

    Woutput(t+1)=Woutput(t)H1gW_{\text{output}}^{(t+1)} = W_{\text{output}}^{(t)} - H^{-1} g

    where gg and HH are computed over the design matrix of hidden activations ZZ, and softmax outputs pp (Patel et al., 2020).

  • Block 2: Hidden layer weights

    • For fixed WoutputW_\text{output}, the objective is non-convex; update via a single gradient-descent or adaptive optimizer step.

This alternated (block-coordinate) scheme leverages convexity where available (output layer) and maintains tractability elsewhere (hidden layers). The Hessian inversion for block 1 is low-rank and scales only with the product of classes and hidden units—not the full model parameter count.

3. LPCD in Quantization: A Unified Submodule Framework

LPCD generalizes and unifies classical and modern post-training quantization (PTQ) schemes by extending the optimization from single-layer (or single-parameter) quantization to blocks corresponding to arbitrary submodules (e.g., multi-layer attention heads, residual blocks).

  • General PTQ Objective:

minM^rQNr×Kr rL(M^1,...,M^R)\min_{\widehat{M}_r \in \mathcal{Q}^{N_r \times K_r}~\forall r} L(\widehat{M}_1, ..., \widehat{M}_R)

capturing the error between quantized and full-precision outputs over a calibration set.

  • LPCD step: For each quantization block, solve a relaxed quadratic (often least-squares) problem over the continuous variable, then project to the quantized space, e.g., via activation-aware rounding or layer-wise operators from GPTQ/AWQ (Ichikawa et al., 1 Dec 2025).
  • Submodule quantization: By grouping functionally coupled parameters (e.g., Q/K/V or MLP up/down layers), LPCD allows the continuous relaxation step to capture block-level error propagation, normalization, or residual connectivity.

This approach recovers QEP, LoaQ, classical GPTQ, or RTN as special cases depending on partition choices and iteration counts.

4. Projection Mechanics and Block Updates

LPCD block updates involve projection onto subspaces or intersection of constraints. For linear least-squares, this corresponds to projecting the current iterate onto the intersection of several hyperplanes determined by chosen coordinates ("layer"). The general block update for layer LL (indices j1,...,jsj_1, ..., j_s) is:

xk+1=xk+EkGk1Uk(bAxk)x_{k+1} = x_k + E_k G_k^{-1} U_k^\top (b - A x_k)

with Uk=[A(j1),...,A(js)]U_k = [A_{(j_1)}, ..., A_{(j_s)}] and Gk=UkUkG_k = U_k^\top U_k, as in the layer-size-ss extension of standard coordinate descent. This generalizes the s=1 classical method (Kaczmarz/Gauss–Southwell) and the s=2 Gram–Schmidt-LPCD update (Jin et al., 2022).

The two-coordinate case is particularly advantageous in settings with high column coherence, as joint projection onto intersected constraints can break local dependence and yield dramatic speedups over purely coordinate-wise updates, as evidenced by numerical results on synthetic highly coherent linear systems.

5. Algorithmic Workflow and Implementation

A canonical LPCD workflow for RR blocks and TT iterations is:

1
2
3
4
5
6
7
8
9
10
11
Input: Blocks M₁,…,M_R; loss L; projection ops {Π^{(j)}}; T

Initialize: ∀j, set quantized baseline  \widehat{M}_j^{(0)} = Π^{(j)}(M_j)
for t = 0, ..., T-1:
    for j = 1, ..., R:
        // Relax block j
        \overline{M}_j ← argmin_U L(…, U, …)
        // (optionally add proximal penalty)
        // Project onto quantization grid
        \widehat{M}_j^{(t+1)} ← Π^{(j)}(\overline{M}_j)
return (\widehat{M}_1^{(T)}, ..., \widehat{M}_R^{(T)})
(Ichikawa et al., 1 Dec 2025)

For large blocks, the relaxed least-squares problem is solved approximately via gradient descent due to computational constraints, especially in LLMs and large-scale models. The procedure is compatible with existing PTQ pipelines; only the continuous optimization and projection steps need modification.

6. Convergence, Complexity, and Empirical Results

  • Convergence: For convex relaxation steps (e.g., least-squares, cross-entropy in the linear layer), block coordinate descent guarantees that each block update finds a global minimum for the local subproblem, and all limit points of the iterates are stationary for the full objective under standard smoothness and bounded-level-set assumptions (Patel et al., 2020, Ichikawa et al., 1 Dec 2025). Empirically, for high-coherence problems and in model quantization, convergence is rapid and robust.
  • Computational complexity: For deep-learning training, block Hessian inversion scales as O((CB)3)O((C B)^3) (for classes CC and hidden dimension BB), far less than O(P3)O(P^3) for full-model Newton methods, making second-order steps tractable. PTQ-LPCD for submodule blocks requires O(TN2)O(T N^2) or comparable effort for gradient-based subproblem solves, but this is amortized over small calibration sets (Patel et al., 2020, Ichikawa et al., 1 Dec 2025).
  • Empirical observations:
    • In deep networks, applying Newton updates to the final layer dramatically improves convergence and generalization: e.g., on CIFAR-10, LPCD reaches higher accuracy in one-fourth the epochs and yields smoother hidden representations than Adam or SGD alone (Patel et al., 2020).
    • For quantization, LPCD consistently outperforms QEP and LoaQ across 4-, 3-, and 2-bit regimes. On LLaMA3 8B at 3 bits, LPCD achieves 9.81\approx9.81 PPL, compared to QEP (25.39\approx25.39) and LoaQ (14.15\approx14.15); for Qwen3 8B at 2 bits, LPCD reaches $58.8$ PPL while alternatives degrade to $165.7$ or $550.3$ (Ichikawa et al., 1 Dec 2025).

7. Extensions, Limitations, and Generalization

LPCD inherently includes standard coordinate descent (block size 1), block/orthogonalized schemes with explicit Gram–Schmidt projection for small block size, and generalized submodule quantization for arbitrary blockings (Jin et al., 2022, Ichikawa et al., 1 Dec 2025). The flexibility in block and projection selection allows LPCD to model error propagation, normalization, and residual connections, achieving output-level optimality where analytic projection is feasible.

Limitations include the need for good block partitioning (especially in PTQ), potential computational bottlenecks for very large submodules, and challenges in extending beyond linear/quadratic/convex submodules (e.g., nonlinearities as softmax/GELU require heuristic relaxations). Convergence rate and empirical behavior depend on problem structure (column coherence, calibration set quality, bit-width, etc.) and projection operator choice.

LPCD provides a comprehensive methodological and algorithmic substrate for both optimization in machine learning and post-training quantization across modern neural network architectures, unifying a broad class of existing approaches and enabling efficient, tractable, and performance-optimal block/submodule optimization (Patel et al., 2020, Ichikawa et al., 1 Dec 2025, Jin et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layer-Projected Coordinate Descent (LPCD).