Sliced Mixture Wasserstein (SMix-W)

Updated 26 January 2026

Sliced Mixture Wasserstein (SMix-W) is a distance metric that compares Gaussian mixture models via 1D projections, offering both efficiency and theoretical soundness.
It integrates mixture Wasserstein costs over random or adaptive slice distributions to reduce the computational complexity while preserving discriminative power.
Empirical studies show that SMix-W achieves significant speedups and reliable performance in clustering, generative modeling, and domain adaptation tasks.

The Sliced Mixture Wasserstein (SMix-W) distance is a computationally efficient metric for comparing Gaussian mixture models (GMMs) and, more generally, high-dimensional probability distributions. SMix-W, also known as SMW or integrated-slice mixture Wasserstein, serves as a lower-complexity alternative to the original Mixture Wasserstein (MW) distance, offering significant computational speedups while maintaining discriminative capacity and key metric properties. The SMix-W and its max-sliced variant (MixSW) have been established as practical and theoretically sound metrics for clustering, generative modeling, domain adaptation, and related large-scale machine learning tasks involving GMMs and empirical measures (Piening et al., 11 Apr 2025, Ohana et al., 2022).

1. Foundations: Mixture and Sliced Wasserstein Metrics

Given two GMMs in $\mathbb{R}^d$ ,

$\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$

with mixture weights $w, v$ , means $m_i, n_j$ , and covariances $\Sigma_i, \Lambda_j$ , the Mixture Wasserstein distance is defined as

$\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}$

with $W_2$ the closed-form 2-Wasserstein distance between two Gaussians and $\pi$ a coupling.

The Sliced Wasserstein distance (SW) between measures $\mu,\nu$ is given by

$\mathrm{SW}_p(\mu, \nu) = \left[ \int_{S^{d-1}} W_p^p(\pi_{\theta\#} \mu,\, \pi_{\theta\#} \nu)\, d\theta \right]^{1/p},$

where $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 0 denotes the projection of $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 1 onto the line direction $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 2.

The Sliced Mixture Wasserstein (SMix-W or SMW) interpolates between these constructions, integrating the mixture Wasserstein cost over random or uniform projections, yielding significant computational efficiencies.

2. Definitions: SMix-W, Mix-SW, and Variants

The SMix-W distance is defined for GMMs via integration over the unit sphere: $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 3 where $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 4 denotes the 2-Wasserstein distance between the projected (1D) GMMs.

The max-sliced variant, MixSW, is given by

$\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 5

by concentrating the entire slice distribution $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 6 on the maximal direction.

Table 1 summarizes key variants and their measures of integration:

Name	Slice Distribution $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 7	Operation
SMW	Uniform on $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 8	Averaging/integral
MixSW	Dirac at best direction	Maximum
Random-slice MW	Empirical (randomized)	Sampled averaging or max

3. Theoretical Properties and Metric Relations

The SMW, MixSW, and related sliced mixture metrics are bona-fide distances and satisfy the following monotonicity chain (Piening et al., 11 Apr 2025): $\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),$ 9 where DSMW refers to the integral over sliced SW, and MSW is a minimization over couplings of SW costs.

Additional equivalence properties include:

Strong two-sided bounds between SMW and DSMW on compact sets of GMMs, i.e., $w, v$ 0 such that for all GMMs in a compact set $w, v$ 1, $w, v$ 2.
No universal two-sided bound between MW and its max or averaged sliced versions, but always $w, v$ 3.

This structure ensures that SMW acts as a tight, tractable lower bound to MW for practical purposes.

4. Computational Complexity and Algorithms

The computational advantage of SMW and MixSW arises from two sources:

Projection to 1D: Each projection reduces the comparison of two high-dimensional GMMs to that of their 1D projections, which admits closed-form computation using 1D Gaussian formulas and efficient assignment solvers.
Sampling Efficiency: Methods operate by sampling $w, v$ 4 directions $w, v$ 5 and computing $w, v$ 6, approximating SMW via averaging, and MixSW via maximization.

The resulting algorithm scales as $w, v$ 7 versus $w, v$ 8 for traditional MW evaluated via high-dimensional linear assignment and matrix square roots. In practice, $w, v$ 9, yielding 10–100 $m_i, n_j$ 0 speedups for $m_i, n_j$ 1 in the range $m_i, n_j$ 2– $m_i, n_j$ 3 and ambient dimension $m_i, n_j$ 4 up to $m_i, n_j$ 5.

5. Practical Considerations and Experimental Validation

Empirically, SMW and MixSW perform on par with MW across applications:

Clustering and cluster-number detection: The MixSW distance graph between GMMs of size $m_i, n_j$ 6 and $m_i, n_j$ 7 exhibits a pronounced drop at the true number of clusters.
Perceptual metrics: Replacing MW by MixSW in GMM-based perceptual distances, including adaptations of WaM (Wasserstein-means), closely reproduces FID curves with a $\sim$ 44%%%%433%%%% runtime reduction.
GMM quantization and minimization: Auto-differentiation through MixSW enables direct optimization of GMM approximations, with rapid convergence and step costs well below a second per iteration for moderately sized mixtures.

Choice of hyperparameters impacts practical performance:

Number of projections $\Sigma_i, \Lambda_j$ 0 (100–500 typical)
Uniform versus quasi–Monte Carlo slice sampling for variance reduction
Stopping criteria (e.g., stabilization of $\Sigma_i, \Lambda_j$ 1 within a given tolerance)

6. Adaptive Slicing and Learning the Slice Distribution

The SMix-W formalism generalizes to arbitrary (possibly data-dependent) slice distributions $\Sigma_i, \Lambda_j$ 2 on $\Sigma_i, \Lambda_j$ 3. The adaptive (mixed) Sliced Wasserstein (Mix-SW) distance is

$\Sigma_i, \Lambda_j$ 4

for probability measures $\Sigma_i, \Lambda_j$ 5. This expressive framework is enabled by PAC-Bayesian generalization bounds that hold uniformly over all $\Sigma_i, \Lambda_j$ 6 (Ohana et al., 2022).

Optimizing the slice distribution $\Sigma_i, \Lambda_j$ 7 for maximal discrimination can be done via (i) parametric families such as von Mises–Fisher on $\Sigma_i, \Lambda_j$ 8, or (ii) push-forwarding angular distributions through neural networks. The learning procedure seeks

$\Sigma_i, \Lambda_j$ 9

where $\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}$ 0 is the Kullback-Leibler divergence from a reference (e.g., uniform) distribution. Regularization via $\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}$ 1 ensures generalization and avoids overfitting to a particular slice.

Empirical results indicate that learned or data-adaptive slicing distributions yield more discriminative and generalizable distances compared to uniform or max-slice approaches, especially in high dimensions or low sample size regimes.

7. Significance and Future Directions

The sliced mixture Wasserstein framework provides a rigorous, scalable, and discriminative alternative to classical OT-based metrics for GMMs and empirical measures. Key advantages include computational tractability, provable metric properties, flexibility to adapt slice distributions, and empirical efficacy across clustering, generative modeling, and perceptual similarity tasks. Ongoing research focuses on improving the adaptivity and robustness of slice-distribution learning, exploring non-Gaussian mixtures, and extending PAC-Bayesian generalization analysis to broader classes of optimal transport-inspired metrics (Piening et al., 11 Apr 2025, Ohana et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

Slicing the Gaussian Mixture Wasserstein Distance (2025)

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sliced Mixture Wasserstein (SMix-W).