Layer-Projected Coordinate Descent (LPCD)

Updated 3 December 2025

LPCD is a block-based optimization method that alternates relaxation and projection steps on variable groups to efficiently solve high-dimensional problems.
In deep network classification, LPCD leverages convexity in output layers and gradient-based updates in hidden layers to improve convergence and generalization.
LPCD unifies submodule quantization strategies by optimizing and projecting on grouped parameters, outperforming traditional methods in speed and accuracy.

Layer-Projected Coordinate Descent (LPCD) is a family of coordinate-based optimization algorithms that operate on blocks or "layers" of variables, performing joint optimization or projection steps on these blocks rather than on single coordinates. In modern applications, LPCD is used both as a robust method for high-dimensional regression/classification and as a unifying strategy for submodule- or layer-wise quantization in deep neural networks, leveraging convexity, projection techniques, and relaxations for tractable large-scale optimization and quantization tasks (Patel et al., 2020, Ichikawa et al., 1 Dec 2025, Jin et al., 2022).

1. Fundamental Principle and Mathematical Formulation

LPCD generalizes coordinate descent by moving along high-dimensional subspaces or blocks (which may correspond to network layers, submodules, or arbitrary groups of coordinates), optimizing or projecting onto their solution manifolds, then iteratively updating the solution state. If the global objective admits separable or partially convex structure, this approach can realize dramatic computational and convergence benefits.

Generalizing to $R$ blocks, for each block $r$ with parameters or variables $M_r \in \mathbb{R}^{N_r \times K_r}$ and chosen "feasible set" (e.g., quantization grid or parameter domain) $\mathcal{Q}^{N_r \times K_r}$ , LPCD alternately "relaxes" (optimizes in a continuous space) and "projects" (maps onto feasible or quantized domain) as follows:

Relaxation step (block $r$ at iteration $t$ ):

$\overline{M}_r^{(t)} = \arg\min_{U \in \mathbb{R}^{N_r \times K_r}} L(\dots, U, \dots)$

where $L$ is the relevant loss (e.g., cross-entropy for classification, mean squared error for quantization fitting).

Projection step:

$\widehat{M}_r^{(t+1)} = \Pi_{\mathcal{Q}}^{(r)}(\overline{M}_r^{(t)})$

where $\Pi_{\mathcal{Q}}^{(r)}$ is a suitable projection/quantization operator (layer-wise, activation-aware, etc.) (Ichikawa et al., 1 Dec 2025).

This alternating block update continues per cycle and per block, generating a monotonic sequence (in practice) of feasible solutions that reduce the global objective.

2. LPCD for Deep Network Classification

LPCD arises naturally in deep networks with a structure that enables partial convexity—specifically, the case of minimizing cross-entropy loss with respect to hidden/unit weights ( $r$ 0) and output/linear weights ( $r$ 1):

Block 1: Linear/output layer weights
- Objective for fixed hidden layers: minimize cross-entropy, which is globally convex in $r$ 2.
- Perform an exact Newton step using closed-form gradient and Hessian for cross-entropy loss:
$r$ 3

where $r$ 4 and $r$ 5 are computed over the design matrix of hidden activations $r$ 6, and softmax outputs $r$ 7 (Patel et al., 2020).
Block 2: Hidden layer weights
- For fixed $r$ 8, the objective is non-convex; update via a single gradient-descent or adaptive optimizer step.

This alternated (block-coordinate) scheme leverages convexity where available (output layer) and maintains tractability elsewhere (hidden layers). The Hessian inversion for block 1 is low-rank and scales only with the product of classes and hidden units—not the full model parameter count.

3. LPCD in Quantization: A Unified Submodule Framework

LPCD generalizes and unifies classical and modern post-training quantization (PTQ) schemes by extending the optimization from single-layer (or single-parameter) quantization to blocks corresponding to arbitrary submodules (e.g., multi-layer attention heads, residual blocks).

General PTQ Objective:

$r$ 9

capturing the error between quantized and full-precision outputs over a calibration set.

LPCD step: For each quantization block, solve a relaxed quadratic (often least-squares) problem over the continuous variable, then project to the quantized space, e.g., via activation-aware rounding or layer-wise operators from GPTQ/AWQ (Ichikawa et al., 1 Dec 2025).
Submodule quantization: By grouping functionally coupled parameters (e.g., Q/K/V or MLP up/down layers), LPCD allows the continuous relaxation step to capture block-level error propagation, normalization, or residual connectivity.

This approach recovers QEP, LoaQ, classical GPTQ, or RTN as special cases depending on partition choices and iteration counts.

4. Projection Mechanics and Block Updates

LPCD block updates involve projection onto subspaces or intersection of constraints. For linear least-squares, this corresponds to projecting the current iterate onto the intersection of several hyperplanes determined by chosen coordinates ("layer"). The general block update for layer $M_r \in \mathbb{R}^{N_r \times K_r}$ 0 (indices $M_r \in \mathbb{R}^{N_r \times K_r}$ 1) is:

$M_r \in \mathbb{R}^{N_r \times K_r}$ 2

with $M_r \in \mathbb{R}^{N_r \times K_r}$ 3 and $M_r \in \mathbb{R}^{N_r \times K_r}$ 4, as in the layer-size- $M_r \in \mathbb{R}^{N_r \times K_r}$ 5 extension of standard coordinate descent. This generalizes the s=1 classical method (Kaczmarz/Gauss–Southwell) and the s=2 Gram–Schmidt-LPCD update (Jin et al., 2022).

The two-coordinate case is particularly advantageous in settings with high column coherence, as joint projection onto intersected constraints can break local dependence and yield dramatic speedups over purely coordinate-wise updates, as evidenced by numerical results on synthetic highly coherent linear systems.

5. Algorithmic Workflow and Implementation

A canonical LPCD workflow for $M_r \in \mathbb{R}^{N_r \times K_r}$ 6 blocks and $M_r \in \mathbb{R}^{N_r \times K_r}$ 7 iterations is:

$\mathcal{Q}^{N_r \times K_r}$ 9 (Ichikawa et al., 1 Dec 2025)

For large blocks, the relaxed least-squares problem is solved approximately via gradient descent due to computational constraints, especially in LLMs and large-scale models. The procedure is compatible with existing PTQ pipelines; only the continuous optimization and projection steps need modification.

6. Convergence, Complexity, and Empirical Results

Convergence: For convex relaxation steps (e.g., least-squares, cross-entropy in the linear layer), block coordinate descent guarantees that each block update finds a global minimum for the local subproblem, and all limit points of the iterates are stationary for the full objective under standard smoothness and bounded-level-set assumptions (Patel et al., 2020, Ichikawa et al., 1 Dec 2025). Empirically, for high-coherence problems and in model quantization, convergence is rapid and robust.
Computational complexity: For deep-learning training, block Hessian inversion scales as $M_r \in \mathbb{R}^{N_r \times K_r}$ 8 (for classes $M_r \in \mathbb{R}^{N_r \times K_r}$ 9 and hidden dimension $\mathcal{Q}^{N_r \times K_r}$ 0), far less than $\mathcal{Q}^{N_r \times K_r}$ 1 for full-model Newton methods, making second-order steps tractable. PTQ-LPCD for submodule blocks requires $\mathcal{Q}^{N_r \times K_r}$ 2 or comparable effort for gradient-based subproblem solves, but this is amortized over small calibration sets (Patel et al., 2020, Ichikawa et al., 1 Dec 2025).
Empirical observations:
- In deep networks, applying Newton updates to the final layer dramatically improves convergence and generalization: e.g., on CIFAR-10, LPCD reaches higher accuracy in one-fourth the epochs and yields smoother hidden representations than Adam or SGD alone (Patel et al., 2020).
- For quantization, LPCD consistently outperforms QEP and LoaQ across 4-, 3-, and 2-bit regimes. On LLaMA3 8B at 3 bits, LPCD achieves $\mathcal{Q}^{N_r \times K_r}$ 3 PPL, compared to QEP ( $\mathcal{Q}^{N_r \times K_r}$ 4) and LoaQ ( $\mathcal{Q}^{N_r \times K_r}$ 5); for Qwen3 8B at 2 bits, LPCD reaches $\mathcal{Q}^{N_r \times K_r}$ 6 PPL while alternatives degrade to $\mathcal{Q}^{N_r \times K_r}$ 7 or $\mathcal{Q}^{N_r \times K_r}$ 8 (Ichikawa et al., 1 Dec 2025).

7. Extensions, Limitations, and Generalization

LPCD inherently includes standard coordinate descent (block size 1), block/orthogonalized schemes with explicit Gram–Schmidt projection for small block size, and generalized submodule quantization for arbitrary blockings (Jin et al., 2022, Ichikawa et al., 1 Dec 2025). The flexibility in block and projection selection allows LPCD to model error propagation, normalization, and residual connections, achieving output-level optimality where analytic projection is feasible.

Limitations include the need for good block partitioning (especially in PTQ), potential computational bottlenecks for very large submodules, and challenges in extending beyond linear/quadratic/convex submodules (e.g., nonlinearities as softmax/GELU require heuristic relaxations). Convergence rate and empirical behavior depend on problem structure (column coherence, calibration set quality, bit-width, etc.) and projection operator choice.

LPCD provides a comprehensive methodological and algorithmic substrate for both optimization in machine learning and post-training quantization across modern neural network architectures, unifying a broad class of existing approaches and enabling efficient, tractable, and performance-optimal block/submodule optimization (Patel et al., 2020, Ichikawa et al., 1 Dec 2025, Jin et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

A block coordinate descent optimizer for classification problems exploiting convexity (2020)

LPCD: Unified Framework from Layer-Wise to Submodule Quantization (2025)

Greedy double subspaces coordinate descent method via orthogonalization (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layer-Projected Coordinate Descent (LPCD).

Layer-Projected Coordinate Descent (LPCD)

1. Fundamental Principle and Mathematical Formulation

2. LPCD for Deep Network Classification

3. LPCD in Quantization: A Unified Submodule Framework

4. Projection Mechanics and Block Updates

5. Algorithmic Workflow and Implementation

6. Convergence, Complexity, and Empirical Results

7. Extensions, Limitations, and Generalization

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Layer-Projected Coordinate Descent (LPCD)

1. Fundamental Principle and Mathematical Formulation

2. LPCD for Deep Network Classification

3. LPCD in Quantization: A Unified Submodule Framework

4. Projection Mechanics and Block Updates

5. Algorithmic Workflow and Implementation

6. Convergence, Complexity, and Empirical Results

7. Extensions, Limitations, and Generalization

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research