Wasserstein Dependency Measures

Updated 19 February 2026

Distance-maximizing Wasserstein dependency measures are a rigorous framework that quantifies statistical dependence using optimal transport metrics and clear thresholds for independence and maximal dependence.
They leverage convex duality and efficient computational algorithms, such as linear programming and entropic regularization, to estimate dependence in high-dimensional settings.
Applications span reinforcement learning, robust statistics, and Bayesian nonparametrics, offering actionable insights for representation learning and model evaluation.

Distance-maximizing Wasserstein dependency measures constitute a rigorous framework in which statistical dependence between random variables, groups, or more general random structures is quantified via optimal transport metrics. By formulating dependence as a distance—either from independence or from maximal dependence—in an appropriate metric geometry, these measures provide a conceptually unified and metrically sensitive approach to characterizing, estimating, and optimizing statistical relationships. These frameworks are instantiated in a diversity of contexts, including classical dependence assessment, unsupervised representation learning, reinforcement learning, robust statistics, Bayesian nonparametrics, and the geometric analysis of random measures.

1. Core Definitions and Formalism

Let $(X, Y)$ be random variables with joint law $P_{XY}$ on a Polish space $\mathcal X \times \mathcal Y$ , and marginals $P_X$ , $P_Y$ . The $p$ -Wasserstein distance between probability measures $\mu, \nu$ on a metric space $(\mathcal M, d)$ is

$W_{d,p}(\mu, \nu) = \left[ \inf_{\pi \in \Gamma(\mu, \nu)} \mathbb E_{(x, y) \sim \pi} d(x, y)^p \right]^{1/p}$

where $\Gamma(\mu,\nu)$ is the set of couplings with the given marginals.

Distance to Independence and Maximal Dependence

Distance to Independence:

$P_{XY}$ 0

$P_{XY}$ 1 is zero if and only if $P_{XY}$ 2 and $P_{XY}$ 3 are independent (Catalano et al., 7 Oct 2025).

Distance to Maximal Dependence: Given a set of “maximally dependent” couplings $P_{XY}$ 4 (e.g., $P_{XY}$ 5 for measurable maps $P_{XY}$ 6), the distance is

$P_{XY}$ 7

and can be normalized as

$P_{XY}$ 8

which yields $P_{XY}$ 9 at maximal dependence (e.g., $\mathcal X \times \mathcal Y$ 0), and $\mathcal X \times \mathcal Y$ 1 at independence (Catalano et al., 7 Oct 2025).

Kantorovich–Rubinstein Duality: For $\mathcal X \times \mathcal Y$ 2, the Wasserstein distance admits dual formulation:

$\mathcal X \times \mathcal Y$ 3

2. Theoretical Properties and Characterizations

Independence and Maximal Dependence:
- $\mathcal X \times \mathcal Y$ 4 if and only if $\mathcal X \times \mathcal Y$ 5 (independence).
- Normalized indices (e.g., $\mathcal X \times \mathcal Y$ 6, $\mathcal X \times \mathcal Y$ 7 for random measures) attain $\mathcal X \times \mathcal Y$ 8 exactly at maximal dependence and $\mathcal X \times \mathcal Y$ 9 at independence (Catalano et al., 2021, Catalano et al., 7 Oct 2025).
Metric Invariance:
- Wasserstein distances are invariant under isometries of the ground metric. In copula-based settings, indices can be made invariant under monotone reparametrizations of the marginals (Catalano et al., 7 Oct 2025).
Robustness:
- Joint distances (e.g., $P_X$ 0) are Lipschitz continuous under weak convergence of probability measures; they respond linearly to small contamination with independent noise (Catalano et al., 7 Oct 2025).
Sample Complexity:
- For empirical plug-in estimators, $P_X$ 1 achieves parametric rate $P_X$ 2 in low dimensions; rates degrade with ambient dimension $P_X$ 3 as $P_X$ 4 (Catalano et al., 7 Oct 2025).

3. Optimization and Computation

Convex and Dual Formulations

Distance-maximizing Wasserstein dependency measures often reduce to convex or saddle-point problems:

Piecewise-Algebraic Structure: For discrete variables, the distance to independence is the solution to a bilevel linear program, whose minimizer has an explicit piecewise algebraic structure governed by the geometry of the Lipschitz polytope and its dual unit ball. For the case of a Segre variety independence model, the explicit solution is computable via duality and polyhedral geometry (Çelik et al., 2020).
Numerical Quadrature: For dependent random measures, the Wasserstein index of dependence $P_X$ 5 reduces to a one-dimensional quadrature on tail integrals of the Lévy measures, making it practical for high-dimensional ( $P_X$ 6) settings (Catalano et al., 2021).

Efficient Algorithms

Linear Programming: For discrete measures with finite support, Wasserstein distances are computable via network-flow LPs; complexity is $P_X$ 7 for $P_X$ 8 atoms (Catalano et al., 7 Oct 2025).
Entropic Regularization: Sinkhorn’s algorithm solves regularized OT with complexity $P_X$ 9 per $P_Y$ 0-accuracy (Paty et al., 2019, Catalano et al., 7 Oct 2025).
Frank–Wolfe with Eigen-Decomposition: For subspace-robust Wasserstein distances, maximization over low-rank projections is carried out using Frank–Wolfe and eigen-decomposition, keeping computation tractable in moderate dimensions ( $P_Y$ 1) (Paty et al., 2019).

4. Specialized Distance-Maximizing Constructions

Wasserstein Distance Maximizing Intrinsic Control

In reinforcement learning, the “Wasserstein Distance Maximizing Intrinsic Control” (WIC) objective,

$P_Y$ 2

formalizes the goal of skill learning by maximizing the expected Wasserstein distance between the start-state Dirac measure and the skill-conditioned state visitation measure. This differs from mutual information–based methods (VIC/DIAYN) by explicitly incentivizing covering maximal state-space distance, rather than mere distinguishability. The dual is estimated via a neural 1-Lipschitz critic, which provides a shaped reward for policy optimization. Empirical results in various Atari environments demonstrate that WIC yields coverage and returns superior to KL-based diversity baselines, especially in environments with sparse or exploratory tasks (Durugkar et al., 2021).

Subspace Robust Wasserstein Distances

Robustness and interpretability in high dimensions are enhanced by the subspace-robust Wasserstein (SRW) distance:

$P_Y$ 3

where $P_Y$ 4 is the second-moment matrix under coupling $P_Y$ 5 and $P_Y$ 6 its ordered eigenvalues. By focusing on the $P_Y$ 7 largest variance directions, SRW down-weights noise and reveals the intrinsic dimensionality of dependence. It induces an increasing and concave $P_Y$ 8, with an observable “elbow” at the effective subspace dimension of dependency (Paty et al., 2019).

Wasserstein Dependency Measures for Representation Learning

For unsupervised representation learning, replacing the KL-divergence in mutual information objectives with the Wasserstein distance yields the Wasserstein Dependency Measure (WDM):

$P_Y$ 9

Practical estimation leverages 1-Lipschitz neural critics with gradient penalty. Unlike KL-based bounds, which suffer from exponential sample complexity in high mutual information, WDM-based estimation provides stable and complete representation learning, as confirmed in multi-factor synthetic and image-based benchmark tasks (Ozair et al., 2019).

5. Wasserstein Dependence Indices for Random Measures

The Wasserstein index of dependence for random measures, $p$ 0, is defined for infinitely active, completely random vectors (CRVs) on a Polish space, parameterized by Lévy measures. Given $p$ 1 (the joint Lévy measure), $p$ 2 (comonotonic dependence, i.e., on the diagonal), and $p$ 3 (independence, i.i.d.), the index is

$p$ 4

where $p$ 5 is an extended Wasserstein metric between possibly infinite-mass Lévy measures, allowing coupling mass to/from the null vector. $p$ 6 uniquely characterizes independence ( $p$ 7) and maximal dependence ( $p$ 8), and is jointly sensitive to $p$ 9 random measures without reduction to pairwise metrics. Numerical computation involves one-dimensional quadratures over transformations of the Lévy measure’s marginal and sum distributions, enabling practical criterion-based tuning of dependence in prior specification and fair comparison of Bayesian nonparametric models (Catalano et al., 2021).

6. Practical Applications and Empirical Results

Applications of distance-maximizing Wasserstein dependency measures include:

Skill discovery and exploration in RL: Wasserstein-based objectives provide effective intrinsic control mechanisms, outperforming mutual-information approaches in diverse environments (Durugkar et al., 2021).
Robust high-dimensional inference: Subspace robust distances stabilize dependency measurement and clustering under noise and irrelevant features (Paty et al., 2019).
Representation learning: Wasserstein dependency objectives yield more expressive features for downstream prediction, especially when standard mutual information maximization underfits due to insufficient sample complexity (Ozair et al., 2019).
Random measure modeling in Bayesian nonparametrics: The Wasserstein index of dependence enables prior specification and model comparison on a common dependence scale, confirmed by posterior comparisons when priors are matched for $\mu, \nu$ 0 (Catalano et al., 2021).
Algebraic and geometric study of dependence: In discrete models, the explicit polyhedral and algebraic structure of the distance-to-independence functional allows detailed combinatorial and analytic investigation of dependency (Çelik et al., 2020).
General statistical dependency measure: Wasserstein-based indices systematically generalize classical correlation by quantifying both weak and strong, linear and nonlinear, dependence across a broad range of metric spaces (Catalano et al., 7 Oct 2025).

7. Limitations and Open Directions

Challenges include:

The need to solve nontrivial saddle point problems or to navigate the combinatorial complexity of polyhedral structures in large finite spaces.
The bias-variance tradeoff in entropic regularization and the selection of regularization parameters.
The enforcement of strict 1-Lipschitzness in deep neural critics remains practically limited; improved regularization techniques are sought.
Dependence indices are sensitive to the choice of ground metric $\mu, \nu$ 1, which must be defensibly specified for application domains.
Computation of the full algebraic structure of dependency measures becomes demanding at scale, and further work seeks to optimize such algorithms and assess empirical degeneracy in practice (Paty et al., 2019, Çelik et al., 2020, Ozair et al., 2019, Catalano et al., 2021, Catalano et al., 7 Oct 2025).

Distance-maximizing Wasserstein dependency measures thus provide a unified, flexible, and metrically faithful approach to quantifying and optimizing dependence in diverse statistical and machine learning contexts, underpinned by principle, computable convex duality, and distinct geometric and probabilistic interpretations.