Sinkhorn Distances: Entropic-Regularized OT

Updated 22 February 2026

Sinkhorn distances are entropic-regularized optimal transport measures that interpolate between true Wasserstein metrics and kernel MMDs, offering scalable and differentiable computations.
The approach employs the Sinkhorn–Knopp iteration to solve a strictly convex program with debiasing terms, ensuring geometric convergence and efficient GPU acceleration.
Applications span generative modeling, imitation learning, and manifold statistics, balancing theoretical rigor with practical scalability in high-dimensional settings.

Sinkhorn distances (also known as entropic-regularized optimal transport distances or regularized Wasserstein metrics) are a family of geometric divergences between probability measures that arise from adding an entropic penalty to the classical optimal transport (OT) problem and are computed efficiently by matrix scaling algorithms (notably the Sinkhorn–Knopp algorithm). They interpolate between the true Wasserstein metric and kernel-based Maximum Mean Discrepancies (MMDs), offering computational tractability and differentiability at scale. Beyond their foundational role in computational optimal transport, Sinkhorn distances are used in diverse areas such as generative modeling, imitation learning, geometric statistics on manifolds, scalable high-dimensional data analysis, and efficient streaming evaluation of OT. They are also the basis for numerous algorithmic and theoretical advances in large-scale statistical inference, machine learning, and computational geometry.

1. Definition, Formulation, and Theoretical Foundation

Let $(X,d)$ be a compact metric space or a bounded subset of $\mathbb{R}^d$ equipped with a continuous ground cost $c: X \times X \to \mathbb{R}_+$ . For probability measures $\mu, \nu \in \mathcal{P}(X)$ , the entropy-regularized OT (Sinkhorn) cost with regularization parameter $\varepsilon > 0$ is

$\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\|\; \mu \otimes \nu)\right\}$

where $\Pi(\mu, \nu)$ is the set of couplings of $\mu$ and $\nu$ , and $\mathrm{KL}$ is the Kullback–Leibler divergence. The dual form reads

$\mathbb{R}^d$ 0

Since $\mathbb{R}^d$ 1 for $\mathbb{R}^d$ 2, an unbiased Sinkhorn divergence is constructed as

$\mathbb{R}^d$ 3

This debiased divergence is symmetric, non-negative, $\mathbb{R}^d$ 4, and, under mild universal kernel hypotheses, $\mathbb{R}^d$ 5 if and only if $\mathbb{R}^d$ 6 (Feydy et al., 2018). As $\mathbb{R}^d$ 7, $\mathbb{R}^d$ 8, i.e., the true Wasserstein distance.

Key properties include separate convexity, smoothness, metrization of weak convergence, and connections between bias/variance and $\mathbb{R}^d$ 9 (Feydy et al., 2018, Lavenant et al., 2024).

2. Algorithmic Framework and Sinkhorn–Knopp Iteration

For discrete measures $c: X \times X \to \mathbb{R}_+$ 0, $c: X \times X \to \mathbb{R}_+$ 1, and a cost matrix $c: X \times X \to \mathbb{R}_+$ 2, the entropic-regularized OT becomes a strictly convex program,

$c: X \times X \to \mathbb{R}_+$ 3

where $c: X \times X \to \mathbb{R}_+$ 4. The solution $c: X \times X \to \mathbb{R}_+$ 5 has the factorization $c: X \times X \to \mathbb{R}_+$ 6 with $c: X \times X \to \mathbb{R}_+$ 7, and $c: X \times X \to \mathbb{R}_+$ 8 found by iterating the Sinkhorn–Knopp scheme: $c: X \times X \to \mathbb{R}_+$ 9 until marginal errors are below a threshold (Cuturi, 2013, Altschuler et al., 2017). Each update costs $\mu, \nu \in \mathcal{P}(X)$ 0, converges geometrically under mild conditions, and is stable for moderate $\mu, \nu \in \mathcal{P}(X)$ 1. Large-scale GPU acceleration is enabled via batched two-sided mat-vec products (Feydy et al., 2018).

For square cost matrices, greedy coordinate variants such as Greenkhorn (Altschuler et al., 2017) and Newton-type accelerations (Tang et al., 2024) further optimize performance.

3. Statistical, Geometric, and Computational Properties

Sinkhorn divergences interpolate between Wasserstein geometry ( $\mu, \nu \in \mathcal{P}(X)$ 2) and kernel MMD ( $\mu, \nu \in \mathcal{P}(X)$ 3) (Feydy et al., 2018). As $\mu, \nu \in \mathcal{P}(X)$ 4 increases, the estimator becomes more regularized (biased) but exhibits improved sample complexity (from $\mu, \nu \in \mathcal{P}(X)$ 5 for unregularized OT to $\mu, \nu \in \mathcal{P}(X)$ 6 in the large- $\mu, \nu \in \mathcal{P}(X)$ 7 regime) (Feydy et al., 2018, Chizat et al., 2020). For moderate $\mu, \nu \in \mathcal{P}(X)$ 8, $\mu, \nu \in \mathcal{P}(X)$ 9 preserves OT-type geometry but is computationally tractable and differentiable at scale.

In the large- $\varepsilon > 0$ 0 regime, low-rank kernel compression methods (Nyström (Altschuler et al., 2018), hierarchical (Motamed, 2020)) and graph-based geodesic approximations (Huguet et al., 2022) achieve $\varepsilon > 0$ 1 or near-linear complexity while maintaining controlled approximation error.

Theoretical results include non-asymptotic error and complexity bounds, bias-variance trade-offs for $\varepsilon > 0$ 2 versus plug-in estimators, and bounds on the regularization-induced error as a function of $\varepsilon > 0$ 3 and data smoothness (Chizat et al., 2020).

Notably, the Hessian of $\varepsilon > 0$ 4 with respect to the measure defines a Riemannian metric on the probability simplex, inducing a geodesic distance metrizing the weak-* topology and connecting to the structure of RKHS (Lavenant et al., 2024).

4. Extensions: Streaming, Stochastic, and Online Sinkhorn

Sinkhorn divergences enable extensions to streaming and online learning contexts where classic batch OT is infeasible. The online Sinkhorn algorithm (Mensch et al., 2020) maintains kernel mixture representations of dual potentials, updating them incrementally as new samples arrive, yielding nearly consistent estimation of regularized OT from streams. Further, compressed online Sinkhorn (Wang et al., 2023) introduces moment-preserving compression to stabilize memory and computation, achieving convergence rates matching the best nonparametric online schemes and leveraging efficient test function families (Fourier features, Gaussian quadrature).

These algorithms allow OT-based distances to be used as loss functions in continuous deep generative modeling, domain adaptation, and other settings where large or dynamically growing datasets prevent materializing the full cost matrix or kernel.

5. Applications and Generalizations

Sinkhorn divergences underpin numerous machine learning and statistical applications:

Generative modeling: Sinkhorn Autoencoders (SAE) (Patrini et al., 2018) and WAE-OT models leverage $\varepsilon > 0$ 5 as match metrics in latent spaces, yielding geometry-aware, differentiable training that generalizes to non-Euclidean supports and priors.
Imitation learning: SIL (Sinkhorn imitation learning) (Papagiannis et al., 2020) replaces adversarial divergence with an adversarially learned Sinkhorn distance between occupancy measures in an RL setting, providing superior discriminative power and stable training compared to GAN-based approaches.
Representation learning: Sinkhorn divergences regularize unsupervised learning of structured audio representations, enhancing additivity and interpretability (Mimilakis et al., 2020).
Mixture modeling: Chain-Rule OT and its Sinkhorn regularization extend to divergences between statistical mixtures, including Renyi and KL divergences on GMMs, with guarantees on smoothness and upper bounds (Nielsen et al., 2018).
Statistics on manifolds: Geodesic Sinkhorn utilizes graph Laplacian-based heat kernels to match empirically-defined distributions on non-Euclidean supports, capturing geometries inaccessible to classical ( $\varepsilon > 0$ 6-based) Sinkhorn (Huguet et al., 2022).
Stochastic processes: Nested Sinkhorn divergences efficiently compute entropy-regularized (multistage) OT for stochastic processes, offering sharply reduced computational costs compared to full nested LPs (Pichler et al., 2021).

6. Limitations, Riemannian Geometry, and Open Problems

Despite positive definiteness and separate convexity, Sinkhorn divergences are not jointly convex in $\varepsilon > 0$ 7 and their square root fails the triangle inequality, so $\varepsilon > 0$ 8 is not a metric (Lavenant et al., 2024, Feydy et al., 2018). The Riemannian structure defined by the Hessian of $\varepsilon > 0$ 9 yields a geodesic distance metrizing the weak-* topology; however, geometric properties such as geodesic uniqueness or compatibility with classical OT geodesics can diverge in the large- $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\|\; \mu \otimes \nu)\right\}$ 0 regime. The debiasing terms are necessary to avoid entropic fixed points that can conflict with true metric properties.

Ongoing challenges include extending sharp complexity bounds to high-dimensional and manifold scenarios (especially for extremely small $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\|\; \mu \otimes \nu)\right\}$ 1), further improving approximation schemes for streaming and online contexts, and elucidating the geometry and topology of Sinkhorn-induced metric structures (Lavenant et al., 2024, Huguet et al., 2022). Asymptotic statistical efficiency, bias correction (e.g., via Richardson extrapolation (Chizat et al., 2020)), and compatibility with scalable GPU infrastructure remain active research directions.

7. Comparative Table: Key Properties and Regimes

Property	Classical OT	Sinkhorn ( $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\\|\; \mu \otimes \nu)\right\}$ 2)	MMD ( $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\\|\; \mu \otimes \nu)\right\}$ 3)
Positive definite	Yes	Yes	Yes
(Joint) convexity	Yes	No	Yes
Metric (triangle ineq.)	Yes	Not for $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\\|\; \mu \otimes \nu)\right\}$ 4	Yes
Geometry	Wasserstein	Interpolates OT/MMD	Kernel
Stochastic computation	Infeasible	Online/streaming feasible	SGD/minibatch
GPU/parallelization	Hard (LP)	Easy (mat-vec/scaling)	Easiest
Applicability	Small/moderate $\mathrm{OT}_\varepsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \left\{ \iint c(x, y) d\pi(x, y) + \varepsilon\, \mathrm{KL}(\pi \;\\|\; \mu \otimes \nu)\right\}$ 5	Large scale	Massive scale

This delineates the tractable, geometry-aware, and scalable regime occupied by Sinkhorn distances, central to modern computational optimal transport (Cuturi, 2013, Feydy et al., 2018, Chizat et al., 2020).