Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropically Regularized Optimal Transport (EOT)

Updated 9 February 2026
  • Entropically Regularized Optimal Transport (EOT) is a method that introduces an entropy term to the classical optimal transport problem, ensuring smoothness and efficient computation via the Sinkhorn algorithm.
  • The Sinkhorn algorithm, along with its variants like Greenkhorn, iteratively updates scaling factors to enforce marginal constraints with linear convergence and reduced per-iteration cost.
  • EOT provides rigorous statistical tools including central limit theorems and bootstrap methods, which guarantee precise inference and bias-variance control in large-scale data analysis.

Entropically Regularized Optimal Transport (EOT) is a computational and statistical framework for optimal transport (OT) between probability distributions in which the Kantorovich objective is modified by an additional entropy (or relative entropy) term. This regularization yields a strictly convex cost that facilitates efficient computation via the Sinkhorn algorithm and provides strong smoothness and differentiability properties. EOT and its associated Sinkhorn divergences/metrics have become fundamental tools in large-scale data analysis, machine learning, and statistics due to their algorithmic tractability and well-understood inferential properties.

1. Mathematical Formulation: Primal, Dual, and Centered Loss

On a finite metric space X={x1,,xN}X = \{x_1, \ldots, x_N\} with cost matrix CR+N×NC \in \mathbb{R}_+^{N \times N} specified by cij=d(xi,xj)pc_{ij} = d(x_i, x_j)^p, the EOT cost between probabilistic weights a,bΣN:={ai0,iai=1}a, b \in \Sigma_N := \{a_i \geq 0, \sum_i a_i = 1\} is

Wp,εp(a,b):=minTU(a,b)T,C+εH(Tab)W_{p,\varepsilon}^p(a, b) := \min_{T\in U(a,b)} \langle T, C \rangle + \varepsilon H(T \,|\, a\otimes b)

where U(a,b)U(a, b) is the transport polytope of couplings with prescribed marginals and H(Tab)=i,jtijlog(tijaibj)H(T \,|\, a\otimes b) = \sum_{i,j} t_{ij} \log\bigl(\frac{t_{ij}}{a_i b_j}\bigr) is the relative entropy.

The Fenchel dual is

Wp,εp(a,b)=maxu,vRNua+vbεi,jexp(cijuivjε)aibjW_{p, \varepsilon}^p(a, b) = \max_{u, v \in \mathbb{R}^N} u^\top a + v^\top b - \varepsilon \sum_{i,j} \exp\Big(-\frac{c_{ij} - u_i - v_j}{\varepsilon}\Big) a_i b_j

Uniqueness of optimizers holds up to additive constants.

The EOT cost Wp,εpW_{p,\varepsilon}^p is not a true metric since Wp,εp(a,a)>0W_{p,\varepsilon}^p(a, a) > 0 in general. The centered Sinkhorn loss is defined as

Sp,ε(a,b):=Wp,εp(a,b)12(Wp,εp(a,a)+Wp,εp(b,b))S_{p, \varepsilon}(a,b) := W_{p, \varepsilon}^p(a, b) - \frac{1}{2}\Big(W_{p,\varepsilon}^p(a, a) + W_{p,\varepsilon}^p(b, b)\Big)

which is non-negative and vanishes if and only if a=ba = b. As ε0\varepsilon \to 0, Sp,ε(a,b)S_{p,\varepsilon}(a,b) converges to the (unregularized) pp-Wasserstein distance Wpp(a,b)W_p^p(a,b) (Bigot et al., 2017).

2. Algorithmic Computation: Sinkhorn and Greedy Methods

Given K=exp(C/ε)K = \exp(-C/\varepsilon), the unique optimal coupling has the factorized form

T=Diag(u)KDiag(v)T^* = \mathrm{Diag}(u) K \mathrm{Diag}(v)

where u,vR+Nu, v \in \mathbb{R}_+^N are scaling vectors. The scaling vectors solve the marginal constraints by alternating updates: u(+1)=a./(Kv()),v(+1)=b./(Ku(+1))u^{(\ell+1)} = a\,./\,(Kv^{(\ell)}), \quad v^{(\ell+1)} = b\,./\,(K^\top u^{(\ell+1)}) where ./\,./\, denotes componentwise division. This Sinkhorn algorithm converges geometrically (linearly) under positivity of KK, and O~(N2)\tilde{O}(N^2) arithmetic per iteration is required; $50-200$ iterations is typical for moderate NN.

Variants such as Greenkhorn and Greedy Stochastic Sinkhorn update only the most violated row or column at each step, reducing per-iteration cost and in favorable regimes outperform standard Sinkhorn in wall-clock time (Abid et al., 2018).

3. Statistical Properties: Central Limit Theorems and Bootstrap

The map (a,b)Wp,εp(a,b)(a, b)\mapsto W^p_{p,\varepsilon}(a,b) is Fréchet-differentiable on ΣN×ΣN\Sigma_N \times \Sigma_N, with derivative

Wp,εp(a,b)(h1,h2)=uε,h1+vε,h2\nabla W^p_{p,\varepsilon}(a,b)(h_1,h_2) = \langle u_\varepsilon, h_1\rangle + \langle v_\varepsilon, h_2\rangle

where (uε,vε)(u_\varepsilon, v_\varepsilon) are any dual optimizers.

Let a^n\hat a_n and b^m\hat b_m be empirical measures from nn and mm i.i.d. samples. For m/(n+m)γ(0,1)m/(n+m)\to \gamma \in (0,1) and multinomial covariances Σ(a),Σ(b)\Sigma(a), \Sigma(b), asymptotic normality holds:

  • For the (non-centered) Sinkhorn divergence,

n(Wp,εp(a^n,b)Wp,εp(a,b))dG,uε\sqrt{n}\Big(W_{p,\varepsilon}^p(\hat a_n, b)-W_{p,\varepsilon}^p(a,b)\Big) \to_d \langle G, u_\varepsilon\rangle

and for two-sample,

ρn,m(Wp,εp(a^n,b^m)Wp,εp(a,b))dγG,uε+1γH,vε\rho_{n,m}\Big(W_{p,\varepsilon}^p(\hat a_n, \hat b_m) - W_{p,\varepsilon}^p(a,b)\Big) \to_d \sqrt{\gamma}\langle G,u_\varepsilon\rangle + \sqrt{1-\gamma}\langle H, v_\varepsilon\rangle

with G,HG, H: independent Gaussian limits.

  • For the centered Sinkhorn loss Sp,εS_{p,\varepsilon}, similar central limit theorems describe both the alternative (aba\neq b) and null (a=ba=b) regimes. Under a=ba=b, the limit is non-Gaussian and mixes weighted chi-square variables determined by the Hessian of Sp,εS_{p,\varepsilon} at (a,a)(a,a) (Bigot et al., 2017).

These results extend the statistical theory of OT to the regularized regime, and the limit laws are essential for valid hypothesis tests and confidence intervals.

Bootstrap procedures enable practical inference:

  • Under aba \neq b, the law of n(Wp,εp(a^n,b)Wp,εp(a^n,b))\sqrt{n}(W_{p,\varepsilon}^p(\hat a_n^*, b) - W_{p,\varepsilon}^p(\hat a_n, b)) (with bootstrap resampling) converges to the correct asymptotic distribution.
  • Under a=ba=b, standard bootstrap fails due to first-order degeneracy; a second-order correction (Babu correction) recovers consistency.

4. Limit Behavior: The ε0\varepsilon \to 0 Asymptotics and Recovery of OT

As the regularization parameter vanishes, EOT recovers the classical Kantorovich OT problem. With the mild growth condition nεnlog(1/εn)0\sqrt{n}\,\varepsilon_n \log(1/\varepsilon_n) \to 0, the central limit theorem for EOT converges to the classical OT central limit law, and regularized dual optimizers converge to (possibly non-unique) Kantorovich duals (Bigot et al., 2017).

This formalizes the exact sense in which EOT interpolates between maximum-entropy (fully regularized) couplings and the singular, possibly non-unique unregularized OT solutions, providing a controlled, smooth approximation valid in both computational and statistical limits.

5. Practical Applications and Empirical Behavior

Empirical studies confirm theory in both synthetic and real-data regimes:

  • In L×LL\times L discrete grids in R2\mathbb{R}^2 (e.g., L=5,10,20L=5,10,20), the empirical distribution of n(Sp,ε(a^n,b)Sp,ε(a,b))\sqrt{n}(S_{p,\varepsilon}(\hat a_n, b)-S_{p,\varepsilon}(a, b)) matches the CLT prediction even for moderate n103104n\approx 10^3-10^4.
  • For the test H0:a=bH_0:a=b in color histograms of autumn vs.\ winter image sets (3D histograms, 16316^3 grid), two-sample bootstrap tests with ε=10,100\varepsilon=10,100 yielded rejection well beyond the 95%95\% bootstrap band, detecting subtle discrepancies that a classical χ2\chi^2 test missed.
  • Power analysis for one-sample tests (uniform aa vs.\ bb with a linear trend) shows rapid increase in rejection rate as the signal departs from the null.
  • Relative-entropy and plain-entropy forms of regularization yield comparable discriminative performance.

The Sinkhorn divergence and its centered variant thus provide computationally practical and statistically effective tools for measuring discrepancies, performing clustering, hypothesis testing, and other inferential tasks on high-dimensional discrete distributions (Bigot et al., 2017).

6. Key Theoretical and Methodological Insights

The combination of the following structural properties makes EOT particularly attractive:

  • Efficient computation: O(N2)O(N^2) per Sinkhorn iteration with linear convergence and practical scalability to large discrete domains.
  • Differentiability: Fréchet-differentiable everywhere, enabling optimization and gradient-based learning.
  • Rigorous inference: Established CLTs and bootstrap for both the uncentered and centered losses, covering both the alternative and null regimes.
  • Bias-variance control: The regularization parameter ε\varepsilon explicitly tunes the bias–variance trade-off between approximation to OT and numerical/statistical stability.
  • Limit compatibility: As ε0\varepsilon \to 0, all smooth and statistical properties recover those of the original, unregularized OT.

These features ensure that entropy-regularized OT and its Sinkhorn divergence yield a full analytic toolkit for statistical inference, learning, and testing on discrete distributions in high-dimensional spaces, with robust guarantees and practical performance (Bigot et al., 2017).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropically Regularized Optimal Transport (EOT).