Entropically Regularized Optimal Transport

Updated 1 January 2026

Entropically regularized optimal transport is defined by adding an entropy term to the classical optimal transport cost, ensuring existence, uniqueness, and enhanced computational tractability.
The Sinkhorn algorithm efficiently solves the regularized OT problem through iterative diagonal scaling, with proven convergence in Wasserstein distance and KL divergence.
This methodology improves statistical estimation and machine learning applications by mitigating the curse of dimensionality and reducing variance in high-dimensional settings.

Entropically regularized optimal transport (EOT) is a fundamental variant of the classical optimal transport problem, in which an entropy term is added to the transport cost functional to induce strict convexity, enforce absolute continuity of couplings, and enhance computational tractability. The entropic regularization forms a bridge between nonlinear transport theory, matrix scaling algorithms, large deviations, and a broad spectrum of statistical and machine learning applications. EOT is both a theoretical object—revealing deep connections to Schrödinger bridges, variational analysis, and asymptotic statistics—and an algorithmic workhorse, undergirded by scalable iterative solvers such as the Sinkhorn algorithm.

1. Mathematical Formulation and Duality

Given Polish probability spaces $(X, \mu)$ and $(Y, \nu)$ and a continuous integrable cost $c:X\times Y \to [0,\infty)$ , the entropically regularized optimal transport problem seeks a coupling $\pi \in \Pi(\mu, \nu)$ minimizing

$I_\varepsilon = \inf_{\pi \in \Pi(\mu, \nu)} \int_{X \times Y} c(x,y)\,\pi(dx,dy) + \varepsilon H(\pi \Vert \mu \otimes \nu),$

where $H(\pi \Vert \mu \otimes \nu) = \int_{X\times Y} \frac{d\pi}{d(\mu \otimes \nu)} \log \frac{d\pi}{d(\mu \otimes \nu)}\, d(\mu \otimes \nu)$ is the relative entropy. This formulation guarantees the existence and uniqueness of the minimizer $\pi_\varepsilon$ for all $\varepsilon > 0$ .

The dual variational form, derived by Fenchel–Legendre conjugation, reads

$S_\varepsilon = \sup_{\varphi \in L^1(\mu),\, \psi \in L^1(\nu)} \left\{ \int_X \varphi\,d\mu + \int_Y \psi\,d\nu - \varepsilon \int_{X\times Y} \exp{\frac{\varphi(x) + \psi(y) - c(x,y)}{\varepsilon}}\,\mu(dx)\nu(dy) + \varepsilon \right\},$

with the supremum attained at a unique pair of Schrödinger potentials $(f_\varepsilon, g_\varepsilon)$ , up to additive constants, and normalized such that $\int_X f_\varepsilon\,d\mu = \int_Y g_\varepsilon\,d\nu = 0$ . The corresponding optimal coupling has the Gibbs form: $\frac{d\pi_\varepsilon}{d(\mu \otimes \nu)}(x, y) = \exp{\frac{f_\varepsilon(x) + g_\varepsilon(y) - c(x, y)}{\varepsilon}}.$ This factorization underpins the matrix scaling approach and forms the analytic basis for the Sinkhorn algorithm (Nutz et al., 2021).

2. Convergence to Classical Optimal Transport and Schrödinger Potentials

As $\varepsilon \to 0$ , the entropic problem recovers the classical Monge–Kantorovich transport cost: $I_0 = \inf_{\pi \in \Pi(\mu, \nu)} \int c\,d\pi,$ whose dual is the convex problem

$S_0 = \sup_{\varphi,\, \psi}\left\{ \int_X \varphi\,d\mu + \int_Y \psi\,d\nu \,:\, \varphi(x) + \psi(y) \leq c(x, y) \right\}.$

EOT establishes the following rigorous convergence principle for Schrödinger potentials:

Structural property	Description	Source
Compactness of potentials	Schrödinger potentials $(f_{\varepsilon_n}, g_{\varepsilon_n})$ are uniformly $L^1$ -bounded along any sequence $\varepsilon_n \to 0$	(Nutz et al., 2021)
Limit is Kantorovich potentials	Any cluster point $(f, g)$ solves the classical dual and satisfies $f(x) + g(y) \leq c(x, y)$ a.e., i.e., is a pair of Kantorovich potentials	(Nutz et al., 2021)
Strong convergence	If the classical Kantorovich dual is unique, then $(f_\varepsilon, g_\varepsilon) \to (\varphi_0, \psi_0)$ in $L^1$ as $\varepsilon \to 0$	(Nutz et al., 2021)

The entropy term imparts strict convexity in the primal and strictly concave regularity in the dual, ensuring unique potentials and facilitating numerical stability and convergence.

3. Algorithmic Structure: Iterative Methods and Sinkhorn Algorithm

The entropic regularizer enables diagonal scaling (matrix balancing) methods, notably the Sinkhorn algorithm, to solve the regularized OT efficiently. The potentials can be updated as: $\begin{aligned} f_{n+1}(x) &= \mu(x) / \int \exp(-c(x,y)/\varepsilon)\,g_n(y)\,dy, \ g_{n+1}(y) &= \nu(y) / \int \exp(-c(x,y)/\varepsilon)\,f_{n+1}(x)\,dx, \end{aligned}$ iterating until marginal constraints are satisfied. Convergence in Wasserstein distance and KL divergence to the unique entropic-OT coupling is established under moment and regularity conditions (Eckstein et al., 2021). For discrete settings, Sinkhorn scaling rapidly solves problems of substantial size (Bigot et al., 2017).

Advances include greedy stochastic scaling (Greenkhorn, block methods), extragradient saddle-point solvers, and neural network-based dual estimators for high-dimensional empirical OT (Li et al., 2023, Wang et al., 2024). The regularization parameter $\varepsilon$ governs the trade-off between bias, smoothness, and computational speed.

4. Statistical and Variational Implications

The entropic regularization crucially alters the statistical behavior of empirical OT estimators. Classical OT suffers from severe curse-of-dimensionality with convergence rates $O(n^{-1/d})$ . EOT, with centered Sinkhorn divergence, achieves parametric rates $O(n^{-1/2})$ under mild conditions, bringing scalable OT-based statistics to high dimensions (Bigot et al., 2017, Barrio et al., 2020, Wang et al., 2024).

Central limit theorems quantify the limiting distributions of empirical regularized OT losses, enabling construction of robust two-sample tests and bootstrap quantile procedures. The bias introduced by regularization is $O(\varepsilon \log(1/\varepsilon))$ , and the variance reduction is directly linked to statistical consistency and inferential tractability (Barrio et al., 2020, Eckstein et al., 2022). Quantization and martingale coupling techniques yield sharp convergence rates for general cost structures; specifically, for quadratic cost in $\mathbb{R}^d$ ,

$OT_{\mathrm{KL},\,\varepsilon}^{(2)} - W_2^2 = \frac{d}{2} \varepsilon \log\left(1/\varepsilon\right) + O(\varepsilon),$

which is dimensionally sharp (Eckstein et al., 2022).

5. Generalizations: Multi-Marginal and Robust OT, Orlicz Analysis

The EOT paradigm generalizes to multi-marginal settings, non-standard costs, and problems with additional linear or martingale constraints. The variational analysis, including $\Gamma$ -convergence, shows that, under suitable continuity and summability hypotheses, EOT converges to OT (possibly with a “relaxed” cost if the original cost function is not continuous almost everywhere) (Brizzi et al., 7 Jan 2025, Clason et al., 2019, Hiew et al., 2024).

Orlicz space theory is leveraged to analyze existence and regularity of entropic minimizers in continuous settings. The entropy regularizer forces absolute continuity with respect to reference measures (often Lebesgue or product marginals); existence of a minimizer is guaranteed if and only if the marginals have finite entropy (Clason et al., 2019). Extensions include entropy-regularized robust OT, incorporating entropic penalties on marginals for enhanced regularization and robustness (Dahyot et al., 2019).

6. Geometry, Stability, and Schrödinger Bridges

EOT exhibits deep geometric structure via cyclical invariance, a density-based analogue of $c$ -cyclical monotonicity. The structure guarantees stability: entropic OT couplings, potentials, and values are quantitatively Lipschitz or Hölder continuous under perturbations of marginals and cost functions (Eckstein et al., 2021, Ghosal et al., 2021). Exponential tail estimates and large deviations characterize the rate at which the entropic minimizer concentrates onto the support of the classical OT solution in the small- $\varepsilon$ regime (Bernton et al., 2021). In the Schrödinger bridge framework, EOT corresponds to the small-noise limit, linking stochastic transport and deterministic OT (Nutz et al., 2021, Bernton et al., 2021).

7. Computational and Applied Perspectives

The entropic regularization allows for scalable algorithms tailored to modern high-dimensional data domains. Sinkhorn iterations, stochastic block solvers, and neural estimation methods scale polynomially with data size and permit parallelization and GPU acceleration (Liu et al., 2018, Wang et al., 2024, Abid et al., 2018). Domain decomposition, adaptive sparsity, multiscale, and parallel processing further amplify tractability for applications such as massive image comparison, generative modeling, and domain adaptation (Bonafini et al., 2020, Liu et al., 2018).

Sinkhorn divergence and entropy-regularized Wasserstein losses have been adopted in discriminative learning, generative adversarial networks, color transfer, and multi-sample testing, supported by rigorous theoretical and empirical advances. The machine learning community leverages these regularized formulations to sidestep non-smoothness and statistical inefficiencies inherent in classical OT.

In summary, entropically regularized optimal transport provides a robust, scalable, and theoretically justified basis for both computation and inference in transport-based modeling. Its dual structure, convergence theory, and statistical properties make it foundational for modern applications in analysis, statistics, and machine learning.