Batched Conjugate Gradients (mBCG)

Updated 10 February 2026

Batched Conjugate Gradients (mBCG) is a matrix-oriented extension of the classic CG method that solves multiple symmetric positive-definite systems in parallel to accelerate Gaussian process inference.
mBCG leverages batched matrix operations and GPU-friendly algorithms to reduce computational cost from O(n³) to O(n²) per iteration and effectively implements preconditioning.
Empirical benchmarks demonstrate up to 32× speedup and seamless integration with libraries like GPyTorch, facilitating scalable and hardware-efficient GP training and inference.

The batched conjugate gradients (mBCG) algorithm is a matrix-oriented extension of the classic conjugate gradients (CG) method. It solves multiple symmetric positive-definite linear systems $K X = B$ for a shared matrix $K$ and several right-hand sides (RHS) in parallel. mBCG is fundamental to Blackbox Matrix–Matrix (BBMM) inference, enabling scalable and hardware-efficient Gaussian process (GP) inference by leveraging modern GPU architectures and reducing the computational bottleneck of standard GP methods from $O(n^3)$ to $O(n^2)$ per iteration. mBCG also underpins efficient stochastic estimation of traces and log-determinants crucial for kernel hyperparameter optimization and is closely related to cooperative and block CG methods for parallel architectures (Gardner et al., 2018, Bhaya et al., 2012).

1. Classical and Batched Conjugate Gradients

The standard CG algorithm is used for solving $K x = b$ where $K \in \mathbb{R}^{n \times n}$ is symmetric positive-definite. It iteratively builds approximations in the Krylov subspace, minimizing the quadratic $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ . CG is optimal for a single RHS; however, modern applications such as GP inference require solutions for multiple RHS. mBCG extends CG to matrix equations $K X = B$ , where $B \in \mathbb{R}^{n \times t}$ , by maintaining batched iterates $X^{(i)}, R^{(i)}, P^{(i)} \in \mathbb{R}^{n \times t}$ and performing simultaneous updates:

Batched matrix–matrix multiplies: $K$ 0
Per-column step-sizes: $K$ 1
Elementwise and diagonal matrix updates for all batched variables.

These operations allow $K$ 2 independent systems to be solved efficiently within a single kernel on GPU hardware. Since $K$ 3 is typically small ( $K$ 410–20), all batched operations are highly parallelizable (Gardner et al., 2018).

2. Algorithmic Formulation and Pseudocode

mBCG takes as input a kernel matmul routine and a matrix of $K$ 5 RHS vectors $K$ 6. Pseudocode accentuates GPU-friendly, batched operations:

$\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 5

Here, colwise_dot computes per-column inner products. All matrix–matrix multiplications and batchwise vector operations are GPU-amenable (Gardner et al., 2018).

3. Preconditioning and Convergence Acceleration

To improve convergence, mBCG applies preconditioning, most effectively implemented via a low-rank pivoted Cholesky factorization:

$K$ 7

Preconditioned mBCG solves $K$ 8, where $K$ 9 applications and $O(n^3)$ 0 can be computed in $O(n^3)$ 1 and $O(n^3)$ 2. The preconditioned updates require additional matrix solves, which are batched across $O(n^3)$ 3 columns.

Preconditioning, especially with $O(n^3)$ 4–10, reduces required CG iterations by more than 10 $O(n^3)$ 5 for highly correlated kernels such as deep RBF or Matérn, typically lowering iteration counts from 20 to 2–4 for competitive inference accuracy (Gardner et al., 2018).

4. Extraction of Gaussian Process Inference Quantities

A core property of mBCG is that it enables computation of all terms needed for GP training and inference in a single call:

Solves: $O(n^3)$ 6 for posterior means.
Stochastic trace estimates: $O(n^3)$ 7 via probe vectors $O(n^3)$ 8 and their corresponding solutions $O(n^3)$ 9.
Log-determinant: Approximated by stochastic Lanczos quadrature using CG coefficients (the $O(n^2)$ 0), yielding a tridiagonal matrix $O(n^2)$ 1 for each probe. The log-determinant is then $O(n^2)$ 2.

Consequently, there is no need for separate Lanczos or additional iterative runs (Gardner et al., 2018).

5. Relation to Cooperative and Block CG Methods

mBCG is related to block and cooperative CG paradigms, such as cCG (Bhaya et al., 2012), in which $O(n^2)$ 3 "agents" simultaneously maintain their iterates, directions, and residuals, leveraging matrix-valued step-sizes $O(n^2)$ 4 and $O(n^2)$ 5 for coordination. In cCG, inner Gram matrices $O(n^2)$ 6 and $O(n^2)$ 7 (size $O(n^2)$ 8) are computed at every iteration, and updates occur by linear combinations across all agents. This yields finite termination in at most $O(n^2)$ 9 steps and nearly $K x = b$ 0 speedup in wall-clock time on multicore hardware for large $K x = b$ 1, under exact arithmetic and full-rank assumptions. cCG minimizes per-agent flop counts to $K x = b$ 2 at optimal $K x = b$ 3, compared to $K x = b$ 4 for serial CG (Bhaya et al., 2012). A plausible implication is that mBCG, as a GPU-focused batched method, realizes similar efficiencies by organizing all probe systems as "agents" over matrix operations.

6. Complexity, Implementation, and Empirical Benchmarks

The computational cost per mBCG iteration is dominated by the (potentially black-box) kernel matrix–matrix multiply $K x = b$ 5, which costs $K x = b$ 6 if $K x = b$ 7 is dense. Memory requirements are $K x = b$ 8 for storing batched variables. The total cost is $K x = b$ 9 for $K \in \mathbb{R}^{n \times n}$ 0 CG iterations and $K \in \mathbb{R}^{n \times n}$ 1 probe vectors. In comparison, Cholesky decomposition costs $K \in \mathbb{R}^{n \times n}$ 2 plus multiple $K \in \mathbb{R}^{n \times n}$ 3 solves.

When kernel structure enables further acceleration (e.g., SKI or SoR), the cost per iteration is further reduced. Pivoted Cholesky preconditioning imposes negligible overhead for $K \in \mathbb{R}^{n \times n}$ 4.

Empirically, GPyTorch’s mBCG implementation achieves:

20–32 $K \in \mathbb{R}^{n \times n}$ 5 speedup for exact GPs ( $K \in \mathbb{R}^{n \times n}$ 6k)
10–15 $K \in \mathbb{R}^{n \times n}$ 7 for SGPR ( $K \in \mathbb{R}^{n \times n}$ 8k, $K \in \mathbb{R}^{n \times n}$ 9)
20 $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 0 for SKI ( $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 1k, $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 2k) on high-end GPUs, outperforming previous CPU and GPU implementations (Gardner et al., 2018).

7. Implementation in GPyTorch and Practical Considerations

In GPyTorch, kernel objects provide routines for both $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 3 and $\varphi(x) = \frac{1}{2} x^T K x - b^T x$ 4. All batched operations reside on GPU, with minimal data transfer between CPU and device. Pivoted-Cholesky preconditioning is performed efficiently for low ranks and automatically differentiable via PyTorch’s autograd mechanism. This permits backpropagation of the GP marginal likelihood through kernel and preconditioner computations “for free.” The main mBCG loop is expressed using high-level batched matrix algebra (e.g., torch.bmm, einsum), aligning well with hardware accelerators (Gardner et al., 2018).

References:

(Gardner et al., 2018) "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration"
(Bhaya et al., 2012) "A cooperative conjugate gradient method for linear systems permitting multithread implementation of low complexity"

Markdown Report Issue Upgrade to Chat

References (2)

GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration (2018)

A cooperative conjugate gradient method for linear systems permitting multithread implementation of low complexity (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Batched Conjugate Gradients Algorithm (mBCG).