Block Diagonal Averaging

Updated 3 February 2026

Block Diagonal Averaging is a method that partitions matrices into blocks to preserve critical off-diagonal dependencies while reducing computational complexity.
It employs block mean approximation to transform large-scale matrix operations into manageable computations, reducing inversion costs from O(d³) to O(L³).
This approach is applied in second-order optimization and Bayesian model averaging, balancing accuracy with efficiency in high-dimensional settings.

Block Diagonal Averaging encompasses a family of methods in statistical computation and matrix approximation that exploit block-diagonal or block-structured representations to enable efficient inference, optimization, and model selection. These methods replace a large dense matrix with a block-wise summarized structure, dramatically reducing the computational complexity of matrix operations, while still capturing key statistical dependencies that are ignored by fully diagonal approximations. Block diagonal and block mean schemes are influential in contexts ranging from second-order optimization in machine learning to variable selection in high-dimensional linear models.

1. Matrix Partitioning and Block Mean Approximation

For a square matrix $M \in \mathbb{R}^{d \times d}$ , block diagonal/dense averaging begins by partitioning the $d$ rows and columns into $L$ contiguous groups of sizes $s_1,\ldots,s_L$ , forming a partition vector $\mathbf{s} = (s_1, \ldots, s_L)$ . This yields an $L \times L$ block matrix where each $M^{ij}$ is of size $s_i \times s_j$ . Block Mean Approximation (BMA) further summarizes each block to one or two scalars:

Off-diagonal blocks ( $i \neq j$ ): Approximated by replacing all entries with their mean,

$b_{ij} = \frac{1}{s_i s_j} \sum_{m=1}^{s_i} \sum_{n=1}^{s_j} M^{ij}_{mn} , \quad \widehat{M}^{ij} = b_{ij} \mathbf{1}_{s_i \times s_j} .$

Diagonal blocks ( $i = j$ ): Approximated by two parameters: the mean of diagonal and mean of off-diagonal entries,

$\alpha_i = \frac{1}{s_i} \sum_{m=1}^{s_i} M^{ii}_{mm} , \quad \beta_i = \frac{1}{s_i(s_i-1)} \sum_{m \neq n} M^{ii}_{mn} .$

The Frobenius-optimal block approximation is $\widehat{M}^{ii} = \beta_i (\mathbf{1}_{s_i \times s_i} - I_{s_i}) + \alpha_i I_{s_i}$ .

The total structure is captured as $\widehat{M} = \overline{\Lambda} + \overline{B}$ , where $\Lambda$ is a diagonal of $\alpha_i - \beta_i$ and $B$ contains the $b_{ij}$ and $\beta_i$ (Lu et al., 2018). This scheme maintains crucial off-diagonal information otherwise lost in diagonal-only approaches.

2. Efficient Matrix Inversion and Complexity

The key computational advantage arises in the inversion and root computations of the approximated matrix. Rather than directly inverting the full $d \times d$ matrix, BMA enables these operations to be computed using only small $L \times L$ matrices. Specifically, for inversion:

$(\overline{\Lambda} + \overline{B})^{-1} = \overline{\Lambda}^{-1} + \overline{D} ,$

where $D = (\Lambda S + S B S)^{-1} - (\Lambda S)^{-1}$ with $S = \mathrm{diag}(s_1, \ldots, s_L)$ . The dominant operation is the inversion or eigendecomposition of an $L \times L$ matrix, reducing the cost from $O(d^3)$ to $O(L^3)$ , plus $O(d)$ for applying the result to a vector (Lu et al., 2018). Similar reductions hold for square root and inverse square root operations, fundamental for preconditioning in optimization and for computing update directions.

3. Application in Second-Order Optimization

Block mean and block diagonal methods directly address challenges in advanced optimization methods such as the Newton method and AdaGrad, which require inversion or square roots of large second-order derivative (Fisher or Hessian) matrices. Whereas diagonal approximations capture no covariance structure and full-matrix computations are intractable for large $d$ , the block mean approach captures cluster-wise dependencies at affordable computational cost.

Empirical studies show that block-mean-based AdaGrad (AdaGrad-BMA), using block partitions by neural network layer (with each layer as a block), can achieve convergence rates close to the full AdaGrad while being significantly more efficient per iteration. In a $322$-dimensional setting:

AdaGrad-full: $16.85$ ms/iteration
AdaGrad-diag: $5.70$ ms/iteration
AdaGrad-BMA: $10.07$ ms/iteration

BMA captures off-diagonal structure ignored by the diagonal, substantially improving convergence over purely diagonal schemes (Lu et al., 2018).

4. Block Diagonal Approaches in Bayesian Model Averaging

Block-diagonal averaging is also foundational in scalable Bayesian variable selection and model averaging under block-orthogonal designs. When the Gram matrix $X'X$ of a regression design $X$ is block diagonal, statistical inference can be performed independently for each block. Given:

$X'X = \mathrm{diag}(G_1, G_2, \ldots, G_B),$

with $G_b$ positive definite and $\sum_b s_b = p$ , variables are partitioned into $B$ blocks with no cross-covariance. Posterior computations—including marginal likelihoods, variable inclusion probabilities, and model-averaged coefficients—factorize by block.

All required integrals per block, e.g., for the marginal likelihood, reduce to a single one-dimensional quadrature problem:

$I(k, b) = \int_0^\infty \tau^{a_b + k/2 - 1} (1 + \tau)^{-n/2} \exp\left\{ -\frac{R_b(y)}{2 \sigma^2} \frac{\tau}{1+\tau} \right\} \pi_\tau(\tau) d\tau ,$

where $k$ is the number of active predictors in block $b$ and $R_b(y)$ is the block-specific residual sum of squares (Papaspiliopoulos et al., 2016).

5. Model Selection, Averaging, and Computational Scaling

In the context of Bayesian model averaging under block-diagonal designs, both exhaustive best-subset search and model probability integration are tractable if blocks are moderately sized. The BD-select algorithm enumerates all $2^{s_b}$ models within each block (for small $s_b$ ), and then combines blockwise selections via convolution. Overall complexity scales linearly with the number of blocks and exponentially with block size.

For general, non-block-diagonal $X'X$ , spectral clustering of the correlation matrix is used to approximate block structure, enabling the block machinery to operate efficiently as a heuristic (Papaspiliopoulos et al., 2016).

Approach	Memory / Complexity	Captures Off-Diagonals?	Notes
Diagonal approx	$O(d)$	No	Fast, poor structural fidelity
Block mean (BMA)	$O(L^2)$ , $O(L^3+d)$	Yes, within/between blk	Tunable trade-off via block size $L$
Full-matrix	$O(d^2)$ , $O(d^3)$	Yes	Intractable for high $d$
Block-diagonal (BMS)	$O(B)$ per block	No, between blocks	Efficient, tractable Bayesian inference

6. Principles for Block Partitioning and Trade-offs

Partitioning strategy is central to block diagonal averaging:

Coarse partition (small $L$ or $B$ ): Each block averages more structure, lowering computational cost but inducing higher approximation error.
Fine partition (large $L$ or $B$ ): Approximates the original structure more closely (in the limit, the full matrix), albeit with increased cost.
Heuristics: In neural networks, grouping parameters by layer or by type (weights vs. biases) is natural. In regression, spectral clustering can reveal blocks of highly correlated variables.

A plausible implication is that method performance is determined by the fidelity of the block structure to the true dependency graph among parameters, and empirical or domain-informed partitioning can yield substantial gains.

7. Software and Practical Considerations

The R package mombf implements all block-diagonal Bayesian selection and model-averaging algorithms (Papaspiliopoulos et al., 2016). It automates selection, model averaging, and block discovery (via spectral clustering) for regression variable selection. All numerical integration is blockwise and performed with adaptive 1D quadrature, ensuring scalability as long as individual block sizes remain modest.

Experiments consistently demonstrate that block-mean and block-diagonal schemes provide a spectrum of tunable trade-offs, balancing computational feasibility with statistical fidelity: significant accuracy improvements over diagonal methods and tractability in settings where the full-matrix approach is prohibitive. This methodology is widely adopted for efficient second-order optimization and scalable Bayesian model selection in high-dimensional regimes (Lu et al., 2018, Papaspiliopoulos et al., 2016).

Markdown Report Issue Upgrade to Chat

References (2)

Block Mean Approximation for Efficient Second Order Optimization (2018)

Scalable Bayesian variable selection and model averaging under block orthogonal design (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Diagonal Averaging.

Block Diagonal Averaging

1. Matrix Partitioning and Block Mean Approximation

2. Efficient Matrix Inversion and Complexity

3. Application in Second-Order Optimization

4. Block Diagonal Approaches in Bayesian Model Averaging

5. Model Selection, Averaging, and Computational Scaling

6. Principles for Block Partitioning and Trade-offs

7. Software and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Block Diagonal Averaging

1. Matrix Partitioning and Block Mean Approximation

2. Efficient Matrix Inversion and Complexity

3. Application in Second-Order Optimization

4. Block Diagonal Approaches in Bayesian Model Averaging

5. Model Selection, Averaging, and Computational Scaling

6. Principles for Block Partitioning and Trade-offs

7. Software and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research