Sliced Wasserstein Loss Function

Updated 18 January 2026

Sliced Wasserstein Loss Function is a method that decomposes high-dimensional optimal transport into tractable 1D problems using random projections and Monte Carlo integration.
It leverages closed-form 1D Wasserstein computations and differentiable sorting to efficiently compute gradients and improve sample complexity in deep learning tasks.
This loss function has wide applications in generative modeling, variational inference, domain adaptation, and privacy-preserving learning, offering robust performance in high-dimensional settings.

The Sliced Wasserstein loss function provides a computationally tractable alternative to the standard Wasserstein distance for comparing probability distributions, particularly in the context of high-dimensional data and deep generative modeling. By decomposing the optimal transport computation into a collection of one-dimensional problems via projection onto random directions, Sliced Wasserstein losses (often denoted as SW or SWD) enable efficient, provably metric-based training and evaluation of probabilistic models. This framework admits fast Monte Carlo approximation, closed-form gradients, and advantageous sample complexity properties, making it foundational in modern optimal transport-based algorithms across deep learning, domain adaptation, privacy-preserving learning, and more.

1. Mathematical Definition and Fundamental Properties

Given two probability measures $\mu,\nu$ on $\mathbb{R}^d$ , the $p$ -Sliced Wasserstein distance is defined as

$SW_p(\mu, \nu) = \left( \int_{S^{d-1}} W_p^p( P_\theta \# \mu, P_\theta \# \nu ) \, d\theta \right)^{1/p},$

where $P_\theta(x) = \langle \theta, x \rangle$ is the orthogonal projection, and $W_p$ denotes the $p$ -Wasserstein distance between one-dimensional distributions (i.e., the push-forwards $P_\theta \# \mu$ and $P_\theta \# \nu$ ) (Vauthier et al., 10 Feb 2025, Tanguy, 2023, Kolouri et al., 2018).

Notably, in one dimension, $W_p$ admits a closed-form based on quantile functions: $W_p^p(\alpha, \beta) = \int_0^1 | F^{-1}_\alpha(t) - F^{-1}_\beta(t) |^p \, dt,$ with $F^{-1}$ the quantile function. This enables SW to sidestep the exponential complexity in $d$ typically incurred in multi-dimensional OT.

The SW metric satisfies non-negativity, symmetry, and, for $p \geq 1$ , the triangle inequality, and metrizes weak convergence plus moment convergence (Yi et al., 2022, Nguyen et al., 2023).

2. Computational Formulation, Monte Carlo Approximation, and Gradients

Direct evaluation of the SW integral is infeasible in high dimensions. SW losses are approximated via Monte Carlo integration: $SW^p_{p,K}(\mu,\nu) = \frac{1}{K} \sum_{k=1}^K W_p^p(P_{\theta_k}\# \mu, P_{\theta_k}\# \nu),$ where $\{ \theta_k \}_{k=1}^K$ are independent random directions on $S^{d-1}$ (Rakotomamonjy et al., 2021, Rodríguez-Vítores et al., 3 Feb 2025).

For empirical measures with support $\{ x_i \}$ and $\{ y_j \}$ , one computes 1D projections, sorts the lists, and matches sorted elements, leading to $O(n \log n)$ per direction and $O(K n \log n)$ total complexity.

Gradient calculation leverages the differentiability of the 1D Wasserstein loss (almost everywhere outside ties) and the permutation assignment generated by sort indices.

For $q=2$ ,

$\frac{\partial L(X,Y)}{\partial x_i} = \frac{2}{K n} \sum_{k=1}^K ( a_{(i)}^{(k)} - b_{(i)}^{(k)} ) u_k,$

where $a_{(i)}^{(k)}$ denotes the $i$ th sorted projection of $x_i$ along $u_k$ (Rakotomamonjy et al., 2021).

These properties are efficient to implement using autodifferentiation frameworks that support backpropagation through permutation/sorting operators (Heitz et al., 2020, Kolouri et al., 2018).

3. Optimization, Theoretical Properties, and Convergence Results

Minimization of Sliced Wasserstein losses appears in parameter estimation, generative model training, and variational inference.

Key optimization facts:

The SW loss is non-convex globally but convex in each 1D slice; the overall landscape may exhibit saddle points and is only piecewise differentiable due to tie-breaking in sorting (Vauthier et al., 10 Feb 2025, Tanguy, 2023).
Gradient flows with SW loss, both on empirical and continuous spaces, exhibit strict decrease with suitable step size, and do not admit stable minimizers supported on low-dimensional sets (e.g., segments), preventing collapse to degenerate configurations (Vauthier et al., 10 Feb 2025).
Discrete-time stochastic gradient descent under mild regularity assumptions converges almost surely to critical points in the sense of Clarke's generalized subdifferential (Tanguy, 2023, Tanguy et al., 2023).
Monte Carlo approximation of the direction integral converges uniformly over compact sets, and the critical points of empirical losses converge to those of the true functional as $K \to \infty$ (Tanguy et al., 2023).

The sample complexity of the empirical SW estimator is $O(n^{-1/2})$ and is independent of ambient dimension $d$ , in sharp contrast to the curse of dimensionality for the full Wasserstein metric (Rakotomamonjy et al., 2021).

4. Variants and Extensions: Learnable Projections, Energy-Based and Unbalanced SW Losses

Advanced SW loss variants target improved discriminative power, adaptability, and robustness.

Learnable Orthogonal Projections: Rather than sampling random directions, SW can be approximated with a small set of learnable orthonormal bases, optimized end-to-end with generative models. As shown in autoencoder and GAN frameworks, this approach reduces the projection budget (e.g., 4 × 128 vs. thousands of random projections) and can yield superior performance (Wu et al., 2017).
Energy-Based Sliced Wasserstein (EBSW): Instead of averaging over the uniform sphere, EBSW employs an “energy”-weighted distribution of directions,

$p_\lambda(\theta) \propto \exp(\lambda W_1(P_\theta \# \mu, P_\theta \# \nu)),$

concentrating more on directions with high discrepancy. Gradient estimation is performed via importance sampling or Markov Chain Monte Carlo over the sphere, leading to better flows in generative models and point-cloud reconstruction (Nguyen et al., 2023).

Unbalanced SW Loss: For positive measures of unequal mass, unbalanced SW losses apply OT with relaxed marginal constraints in each slice by augmenting with divergence penalties. In the balanced-mass limit, these recover classical SW; the framework is GPU-friendly and admits differentiable Frank-Wolfe-like algorithms (Bonet et al., 2023).

5. Applications in Generative Modeling, Statistical Inference, and Deep Learning

Sliced Wasserstein loss has seen broad adoption in deep generative modeling:

Autoencoders: Regularizing an autoencoder via SW between the latent code distribution and a fixed prior enables prior-agnostic, adversary-free training, as in Sliced Wasserstein Autoencoders (SWAE) (Kolouri et al., 2018).
Adversarial Generative Models: SW loss is used in the generator’s objective or as a critic; learnable projections and block structures further enrich SW-based architectures (Wu et al., 2017).
Variational Inference: SW loss offers a tractable alternative to KL-divergence in variational inference tasks, improving multimodality and support coverage in posterior approximations (Yi et al., 2022).
Neural Texture Synthesis: SWD serves as an alternative to Gram-matrix losses in feature space, offering better capture of higher-order feature statistics and resolution-consistent texture transfer (Heitz et al., 2020).
Domain Adaptation and Distributional Regression: SW loss is adopted for robust metric-based alignment in distributional regression, including applications to climate and finance data (Chen et al., 2023).

Additionally, SW losses are compatible with various manifold data structures, e.g., Cartan–Hadamard manifolds via geodesic and horospherical projection-based definitions; they retain topological consistency and can be optimized using particle schemes (Bonet et al., 2024).

6. Privacy-Preserving Learning and Differential Privacy

The SW loss admits natural adaptation to differential privacy (DP), with the gradient’s sensitivity to a single point in the input scaling as $O(1/n)$ . Gaussian noise can be added directly to projected samples or to the gradients for DP guarantees, yielding the “smoothed” SWD (DP-SWD) which remains a metric and supports rigorous privacy guarantees (Rakotomamonjy et al., 2021, Rodríguez-Vítores et al., 3 Feb 2025).

Algorithmic implementations incorporate per-sample activation/Jacobian clipping and noise calibration via Rényi or Gaussian DP accountants. Sample complexity, sensitivity, and computational costs are all analytically controlled (Rodríguez-Vítores et al., 3 Feb 2025).

7. Empirical and Practical Considerations

Monte Carlo approximation with $K \sim 50$ –$200$ slices balances bias and variance effectively for most tasks. Sorting cost per direction scales as $O(n \log n)$ . Modern autodifferentiation frameworks implement differentiable sorting, simplifying gradient-based optimization (Bonet et al., 2023).

Selecting projection directions: learnable projections can be more efficient than random sampling in certain settings (Wu et al., 2017). Energy-based importance sampling further targets the most informative slices (Nguyen et al., 2023).

Comparisons against classical Wasserstein loss indicate that SW maintains topological equivalence and advantageously avoids the curse of dimensionality, although it may overlook certain high-dimensional structural discrepancies. For texture and style synthesis, SW loss has empirically demonstrated superior fidelity and perceptual attributes compared to Gram-matrix-based metrics (Heitz et al., 2020).

In sum, the Sliced Wasserstein loss function is a cornerstone objective for optimal transport in high-dimensional and data-driven settings. It balances mathematical rigor, computational tractability, sample complexity advantages, and algorithmic flexibility, facilitating its integration into modern probabilistic machine learning pipelines (Vauthier et al., 10 Feb 2025, Tanguy, 2023, Wu et al., 2017, Kolouri et al., 2018, Bonet et al., 2023, Nguyen et al., 2023).