Papers
Topics
Authors
Recent
Search
2000 character limit reached

Batched Conjugate Gradients (mBCG)

Updated 10 February 2026
  • Batched Conjugate Gradients (mBCG) is a matrix-oriented extension of the classic CG method that solves multiple symmetric positive-definite systems in parallel to accelerate Gaussian process inference.
  • mBCG leverages batched matrix operations and GPU-friendly algorithms to reduce computational cost from O(n³) to O(n²) per iteration and effectively implements preconditioning.
  • Empirical benchmarks demonstrate up to 32× speedup and seamless integration with libraries like GPyTorch, facilitating scalable and hardware-efficient GP training and inference.

The batched conjugate gradients (mBCG) algorithm is a matrix-oriented extension of the classic conjugate gradients (CG) method. It solves multiple symmetric positive-definite linear systems KX=BK X = B for a shared matrix KK and several right-hand sides (RHS) in parallel. mBCG is fundamental to Blackbox Matrix–Matrix (BBMM) inference, enabling scalable and hardware-efficient Gaussian process (GP) inference by leveraging modern GPU architectures and reducing the computational bottleneck of standard GP methods from O(n3)O(n^3) to O(n2)O(n^2) per iteration. mBCG also underpins efficient stochastic estimation of traces and log-determinants crucial for kernel hyperparameter optimization and is closely related to cooperative and block CG methods for parallel architectures (Gardner et al., 2018, Bhaya et al., 2012).

1. Classical and Batched Conjugate Gradients

The standard CG algorithm is used for solving Kx=bK x = b where KRn×nK \in \mathbb{R}^{n \times n} is symmetric positive-definite. It iteratively builds approximations in the Krylov subspace, minimizing the quadratic φ(x)=12xTKxbTx\varphi(x) = \frac{1}{2} x^T K x - b^T x. CG is optimal for a single RHS; however, modern applications such as GP inference require solutions for multiple RHS. mBCG extends CG to matrix equations KX=BK X = B, where BRn×tB \in \mathbb{R}^{n \times t}, by maintaining batched iterates X(i),R(i),P(i)Rn×tX^{(i)}, R^{(i)}, P^{(i)} \in \mathbb{R}^{n \times t} and performing simultaneous updates:

  • Batched matrix–matrix multiplies: W=KP(i)W = K P^{(i)}
  • Per-column step-sizes: α(i)=ρ(i)./σ(i)\alpha^{(i)} = \rho^{(i)} ./ \sigma^{(i)}
  • Elementwise and diagonal matrix updates for all batched variables.

These operations allow tt independent systems to be solved efficiently within a single kernel on GPU hardware. Since tt is typically small (\sim10–20), all batched operations are highly parallelizable (Gardner et al., 2018).

2. Algorithmic Formulation and Pseudocode

mBCG takes as input a kernel matmul routine and a matrix of tt RHS vectors BB. Pseudocode accentuates GPU-friendly, batched operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
X = zeros(n, t)
R = B - M(X)
P = R
rho = colwise_dot(R, R)
for i in range(p):
    W = M(P)
    sigma = colwise_dot(P, W)
    alpha = rho / sigma
    X = X + P * diag(alpha)
    R = R - W * diag(alpha)
    rho_new = colwise_dot(R, R)
    beta = rho_new / rho
    P = R + P * diag(beta)
    rho = rho_new

Here, colwise_dot computes per-column inner products. All matrix–matrix multiplications and batchwise vector operations are GPU-amenable (Gardner et al., 2018).

3. Preconditioning and Convergence Acceleration

To improve convergence, mBCG applies preconditioning, most effectively implemented via a low-rank pivoted Cholesky factorization:

KLkLkT,M=LkLkT+σ2I.K \approx L_k L_k^T, \quad M = L_k L_k^T + \sigma^2 I.

Preconditioned mBCG solves M1Kx=M1bM^{-1} K x = M^{-1} b, where M1M^{-1} applications and logM\log|M| can be computed in O(nk2)O(n k^2) and knk \ll n. The preconditioned updates require additional matrix solves, which are batched across tt columns.

Preconditioning, especially with k=5k=5–10, reduces required CG iterations by more than 10×\times for highly correlated kernels such as deep RBF or Matérn, typically lowering iteration counts from 20 to 2–4 for competitive inference accuracy (Gardner et al., 2018).

4. Extraction of Gaussian Process Inference Quantities

A core property of mBCG is that it enables computation of all terms needed for GP training and inference in a single call:

  • Solves: K1yK^{-1} y for posterior means.
  • Stochastic trace estimates: Tr[K1K/θ]\operatorname{Tr}[K^{-1} \partial K/\partial \theta] via probe vectors uiu_i and their corresponding solutions K1uiK^{-1} u_i.
  • Log-determinant: Approximated by stochastic Lanczos quadrature using CG coefficients (the α(i),β(i)\alpha^{(i)}, \beta^{(i)}), yielding a tridiagonal matrix TiT_i for each probe. The log-determinant is then logK1te1TlogTie1\operatorname{log}|K| \approx \frac{1}{t} \sum e_1^T \log T_i e_1.

Consequently, there is no need for separate Lanczos or additional iterative runs (Gardner et al., 2018).

5. Relation to Cooperative and Block CG Methods

mBCG is related to block and cooperative CG paradigms, such as cCG (Bhaya et al., 2012), in which pp "agents" simultaneously maintain their iterates, directions, and residuals, leveraging matrix-valued step-sizes αk\alpha_k and βk\beta_k for coordination. In cCG, inner Gram matrices RkTDkR_k^T D_k and DkTADkD_k^T A D_k (size p×pp \times p) are computed at every iteration, and updates occur by linear combinations across all agents. This yields finite termination in at most n/p\lceil n/p \rceil steps and nearly O(p)O(p) speedup in wall-clock time on multicore hardware for large nn, under exact arithmetic and full-rank assumptions. cCG minimizes per-agent flop counts to O(n2+1/3)O(n^{2+1/3}) at optimal pn2/3p^* \approx n^{2/3}, compared to O(n3)O(n^3) for serial CG (Bhaya et al., 2012). A plausible implication is that mBCG, as a GPU-focused batched method, realizes similar efficiencies by organizing all probe systems as "agents" over matrix operations.

6. Complexity, Implementation, and Empirical Benchmarks

The computational cost per mBCG iteration is dominated by the (potentially black-box) kernel matrix–matrix multiply KPK \cdot P, which costs O(n2t)O(n^2 t) if KK is dense. Memory requirements are O(nt)O(n t) for storing batched variables. The total cost is O(pn2t)O(p n^2 t) for pp CG iterations and tt probe vectors. In comparison, Cholesky decomposition costs O(n3)O(n^3) plus multiple O(n2)O(n^2) solves.

When kernel structure enables further acceleration (e.g., SKI or SoR), the cost per iteration is further reduced. Pivoted Cholesky preconditioning imposes negligible overhead for knk \ll n.

Empirically, GPyTorch’s mBCG implementation achieves:

  • 20–32×\times speedup for exact GPs (n3n \approx 3k)
  • 10–15×\times for SGPR (n50n \approx 50k, m=300m=300)
  • 20×\times for SKI (n500n \approx 500k, m=10m=10k) on high-end GPUs, outperforming previous CPU and GPU implementations (Gardner et al., 2018).

7. Implementation in GPyTorch and Practical Considerations

In GPyTorch, kernel objects provide routines for both KXK X and (K/θ)X(\partial K/\partial \theta) X. All batched operations reside on GPU, with minimal data transfer between CPU and device. Pivoted-Cholesky preconditioning is performed efficiently for low ranks and automatically differentiable via PyTorch’s autograd mechanism. This permits backpropagation of the GP marginal likelihood through kernel and preconditioner computations “for free.” The main mBCG loop is expressed using high-level batched matrix algebra (e.g., torch.bmm, einsum), aligning well with hardware accelerators (Gardner et al., 2018).


References:

  • (Gardner et al., 2018) "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration"
  • (Bhaya et al., 2012) "A cooperative conjugate gradient method for linear systems permitting multithread implementation of low complexity"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Batched Conjugate Gradients Algorithm (mBCG).