Pairwise Divergence Matrices

Updated 5 February 2026

Pairwise divergence matrices are structured data representations that quantify dissimilarities between distributions using measures like KL, JS, and Jeffreys.
They facilitate model-free exploration by revealing block patterns that indicate phase transitions and regime changes in various scientific datasets.
The construction process involves histogram normalization, escort weighting, and optimized computations, supporting applications in imaging, scattering experiments, and SPD data analysis.

Pairwise divergence matrices are data structures in which every entry quantifies the statistical dissimilarity, according to a selected divergence measure, between distinct elements from a finite collection of probability distributions or structured objects. These matrices serve as comprehensive tools for model-free exploration of statistical changes, detection of phase transitions, and downstream tasks in various domains, including scattering experiments, imaging analysis, and information geometry. Their entries are populated by pairwise evaluations of divergences—such as the Kullback-Leibler (KL), Jeffreys, Jensen-Shannon (JS), their antisymmetric variants, or, for matrix-valued data, generalizations like the αβ-log-det divergence—computed between normalized empirical histograms, random variable distributions, or structured objects such as positive-definite matrices (Coles et al., 29 Jan 2026, &&&1&&&, Levada, 2022).

1. Foundational Divergence Measures and Formulas

Let $P=\{p_i\}$ and $Q=\{q_i\}$ denote two discrete probability distributions. Four central divergences underpin pairwise divergence matrices in contemporary applications:

Kullback-Leibler Divergence (KL):

$D_{KL}(P\|Q) = \sum_i p_i \ln\frac{p_i}{q_i}$

Properties: Non-negative, asymmetric ( $D_{KL}(P\|Q)\neq D_{KL}(Q\|P)$ ), not a metric.

Jeffreys Divergence (J):

$D_J(P,Q) = D_{KL}(P\|Q) + D_{KL}(Q\|P)$

Properties: Symmetric, unbounded, non-negative, zero iff $P=Q$ .

Jensen-Shannon Divergence (JS):

With $M = (P+Q)/2$ ,

$D_{JS}(P,Q)=\frac12 D_{KL}(P\|M) + \frac12 D_{KL}(Q\|M)$

Properties: Symmetric, bounded ( $0 \leq D_{JS} \leq \ln 2$ if natural logs), $\sqrt{D_{JS}}$ is a metric.

Antisymmetric Kullback-Leibler Divergence:

$D_{\text{asym}}(P,Q) = D_{KL}(P\|Q) - D_{KL}(Q\|P)$

Properties: Antisymmetric; $D_{\text{asym}}(Q,P) = -D_{\text{asym}}(P,Q)$ .

For positive-definite matrix data, the αβ-log-det divergence provides a unifying generalization: $D_{\alpha,\beta}(X\|Y) = \frac{1}{\alpha\beta}\log\det\left(\frac{\alpha (X Y^{-1})^{\beta} + \beta (X Y^{-1})^{-\alpha}}{\alpha+\beta}\right)$ where $X,Y$ are SPD matrices, and $(\alpha,\beta)$ tune the nature of the divergence (recovering, for instance, the affine-invariant Riemannian metric or Jeffreys divergence as special cases) (Cherian et al., 2017).

2. Mathematical Structure and Properties of Divergence Matrices

The choice of divergence function determines the structure and interpretation of the resulting matrix:

Symmetry: Symmetric divergences (Jeffreys, Jensen-Shannon) produce symmetric matrices, which are especially suitable for visualization and downstream spectral methods. Asymmetric divergences (KL, antisymmetric KL) yield non-symmetric or even antisymmetric matrices; these may capture directional change, such as the evolution of order.
Bounds and Interpretation: Jensen-Shannon divergence is bounded, allowing for consistent scaling across experiments; Jeffreys and KL are unbounded. The metric property of $\sqrt{D_{JS}}$ enables distance-based reasoning, whereas KL and Jeffreys lack the triangle inequality.
Matrix Block Structure: Phases or regimes of statistical similarity in the data manifest as low-divergence blocks along the diagonal. Abrupt boundaries between blocks signal sharp transitions in the underlying data-generating process.

The positive semidefiniteness (PSD) of variable-indexed mutual-information matrices also depends delicately on the generating $f$ -divergence. For $f$ -divergence-induced mutual information $I_f(X;Y)$ , the local theorem states that the pairwise matrix $[I_f(X_i;X_j)]_{i,j}$ is PSD for all weak dependencies and all $n$ if and only if $f$ is analytic near $1$ and the Taylor coefficients (of $(t-1)^m$ , $m\geq2$ ) are all nonnegative (Roberston, 13 Jan 2026). KL and JS divergences fail this condition due to alternating-sign coefficients, implying that Shannon mutual-information matrices can be non-PSD even for weakly dependent variables.

3. Construction Methodologies for Pairwise Divergence Matrices

Pairwise divergence matrices are constructed via the following general procedure:

Normalization: Raw data (e.g., image intensity, scattering counts) are transformed into probability histograms:

$p_i^{(k)} = \frac{I_i^{(k)}}{\sum_j I_j^{(k)}},\qquad\sum_i p_i^{(k)}=1$

Escort Weighting (for feature sensitivity): The escort transformation enhances the influence of features over background by:

$\tilde p_i^{(k)}(n) = \frac{[p_i^{(k)}]^n}{\sum_j [p_j^{(k)}]^n}$

where $n>0$ , with $n=1$ recovering the original histogram. The transformation can be interpreted as setting an artificial temperature $T_a=1/n$ .

Matrix Assembly: For $N$ entities, the $N\times N$ matrix is populated by computing the divergence between each pair:

$[M_{KL}]_{k\ell} = D_{KL}\left(\tilde{p}^{(k)}\|\tilde{p}^{(\ell)}\right)$

Repeat for each selected divergence.

Extension to Structured Data: For matrix-valued data (e.g., SPD matrices as in computer vision), divergences such as $D_{\alpha,\beta}(X_i\|X_j)$ are used analogously.

This general schema is applicable to histograms, GMRF parameters, and SPD matrices, with the divergence measure dictating the technical specifics (Coles et al., 29 Jan 2026, Cherian et al., 2017, Levada, 2022).

4. Interpretation: Block Structures and Statistical Phase Transitions

Block-diagonal or near-block patterns in divergence matrices have diagnostic significance:

Single-Phase Domains: Regions of small entries signal clusters of statistically similar frames, corresponding to a homogeneous phase or regime.
Phase Boundaries: Sharp transitions between blocks (as seen by step-like increases in divergences) indicate underlying phase transitions or abrupt statistical changes in physical or biological systems.

Empirical illustrations include:

System / Observable	Matrix Block Structure Interpretation	Divergences Used	Reference
Eu₃Sn₂S₇ neutron diffraction (T sweep)	Large low-divergence block below T₉ₕₐₜ ~6 K, sharp boundary at transition	$D_{KL}$ , $D_J$ , $D_{JS}$ , $D_{\rm asym}$	(Coles et al., 29 Jan 2026)
Cd₂Re₂O₇ X-ray diffuse scattering (T sweep)	Multiple blocks at T~100K, 130K, 200K, 250K, tracking structural transitions	All four above	(Coles et al., 29 Jan 2026)
Fe₃GeTe₂ Lorentz-TEM images	Block cutoff marking skyrmion order crossover (T sweep)	$D_J$ , $D_{JS}$ , $D_{KL}$	(Coles et al., 29 Jan 2026)

This matrix-based approach eliminates the requirement for explicit order parameters or domain-specific feature extraction, enabling direct interrogation of high-dimensional observational data. In magnetic and structural phase transition studies, the precise block onsets in $D_J$ and $D_{JS}$ matrices track critical temperatures and fields.

5. Extension to Matrix-Valued and Model-Based Data

For higher-order objects such as SPD matrices or Gaussian-Markov random fields (GMRFs):

SPD Data: The pairwise αβ-log-det divergence matrix, defined for matrices $X_1,...,X_n$ , captures geometric dissimilarities relevant to visual data, with the choice of $(\alpha,\beta)$ interpolating between classical divergences. Direct computation scales as $O(n^2 d^3)$ for $n$ matrices of size $d\times d$ , often mitigated via dictionary-based embeddings (Cherian et al., 2017).
GMRFs: For $n$ GMRFs parametrized by $\beta_i$ , the divergence matrix $D_{ij} = \frac{n}{2}[\beta_j/\beta_i - 1 - \ln(\beta_j/\beta_i)]$ is available in closed form, allowing efficient measurement of field similarity in graphical models and image analysis (Levada, 2022).

Such generalizations enable the application of divergence matrix methods to a wide array of statistical models and structured data forms.

6. Computational Considerations and Acceleration

The construction of pairwise divergence matrices incurs $O(N^2 M)$ complexity for $N$ samples and $M$ features (histogram bins, vector size, or matrix dimension):

Symmetry Exploitation: For symmetric divergences, only the upper (or lower) triangle needs evaluation.
Vectorization and Parallelism: Inner products and divergence computations are vectorized or parallelized, leveraging GPU or multi-threaded CPU computation.
Region-of-Interest (ROI) Restriction: Pre-cropping or downsampling high-dimensional arrays reduces $M$ .
Memory Policies: Computing divergences on-the-fly during parameter sweeps circumvents storage of all intermediate histograms.
Dictionary-based Embeddings: For SPD data, reduction from $O(n^2 d^3)$ to $O(n M d^3)$ is achieved by projecting data onto a learned dictionary, preserving most discriminative power (Cherian et al., 2017).
Streaming Implementation: Running means for divergence calculation facilitate online or "on-the-fly" analytic workflows (Coles et al., 29 Jan 2026).

Practical implementations thus scale to large experiment sizes, dimensions, or sample collections.

7. Theoretical and Methodological Distinctions

The properties of pairwise divergence matrices differ fundamentally from Hilbertian or metric properties on the space of distributions:

A divergence (e.g., $\sqrt{D_{JS}}$ ) may be a metric on distributions but fail to yield a PSD pairwise mutual information matrix across weakly dependent random variables.
Conversely, the χ²-divergence produces PSD matrices even though it is not a metric on distributions (Roberston, 13 Jan 2026).
The analytic expansion and sign of Taylor coefficients of the generating function $f$ are determinative for local PSD in variable-indexed mutual information matrices.
For specific divergences and parameter regimes, explicit counterexamples to PSD exist for arbitrary weak dependency, confirming the precision and restrictiveness of the local characterization theorem (Roberston, 13 Jan 2026).

A plausible implication is that the choice of divergence for constructing pairwise matrices must be tailored to both the structural and analytic properties required by the application—such as clustering, spectral analysis, or interpretation as a Gram matrix.

Pairwise divergence matrices thus provide a rigorous, flexible, and interpretable apparatus for the analysis of complex, high-dimensional datasets, enabling both practical scientific discovery and foundational advances in information theory and statistical learning.