Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sliced Mixture Wasserstein (SMix-W)

Updated 26 January 2026
  • Sliced Mixture Wasserstein (SMix-W) is a distance metric that compares Gaussian mixture models via 1D projections, offering both efficiency and theoretical soundness.
  • It integrates mixture Wasserstein costs over random or adaptive slice distributions to reduce the computational complexity while preserving discriminative power.
  • Empirical studies show that SMix-W achieves significant speedups and reliable performance in clustering, generative modeling, and domain adaptation tasks.

The Sliced Mixture Wasserstein (SMix-W) distance is a computationally efficient metric for comparing Gaussian mixture models (GMMs) and, more generally, high-dimensional probability distributions. SMix-W, also known as SMW or integrated-slice mixture Wasserstein, serves as a lower-complexity alternative to the original Mixture Wasserstein (MW) distance, offering significant computational speedups while maintaining discriminative capacity and key metric properties. The SMix-W and its max-sliced variant (MixSW) have been established as practical and theoretically sound metrics for clustering, generative modeling, domain adaptation, and related large-scale machine learning tasks involving GMMs and empirical measures (Piening et al., 11 Apr 2025, Ohana et al., 2022).

1. Foundations: Mixture and Sliced Wasserstein Metrics

Given two GMMs in Rd\mathbb{R}^d,

μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),

with mixture weights w,vw, v, means mi,njm_i, n_j, and covariances Σi,Λj\Sigma_i, \Lambda_j, the Mixture Wasserstein distance is defined as

MW(μ,ν)=minπΠ(w,v)[i=1kj=1πijW22(N(mi,Σi),N(nj,Λj))]1/2\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}

with W2W_2 the closed-form 2-Wasserstein distance between two Gaussians and π\pi a coupling.

The Sliced Wasserstein distance (SW) between measures μ,ν\mu,\nu is given by

SWp(μ,ν)=[Sd1Wpp(πθ#μ,πθ#ν)dθ]1/p,\mathrm{SW}_p(\mu, \nu) = \left[ \int_{S^{d-1}} W_p^p(\pi_{\theta\#} \mu,\, \pi_{\theta\#} \nu)\, d\theta \right]^{1/p},

where μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),0 denotes the projection of μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),1 onto the line direction μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),2.

The Sliced Mixture Wasserstein (SMix-W or SMW) interpolates between these constructions, integrating the mixture Wasserstein cost over random or uniform projections, yielding significant computational efficiencies.

2. Definitions: SMix-W, Mix-SW, and Variants

The SMix-W distance is defined for GMMs via integration over the unit sphere: μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),3 where μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),4 denotes the 2-Wasserstein distance between the projected (1D) GMMs.

The max-sliced variant, MixSW, is given by

μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),5

by concentrating the entire slice distribution μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),6 on the maximal direction.

Table 1 summarizes key variants and their measures of integration:

Name Slice Distribution μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),7 Operation
SMW Uniform on μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),8 Averaging/integral
MixSW Dirac at best direction Maximum
Random-slice MW Empirical (randomized) Sampled averaging or max

3. Theoretical Properties and Metric Relations

The SMW, MixSW, and related sliced mixture metrics are bona-fide distances and satisfy the following monotonicity chain (Piening et al., 11 Apr 2025): μ=i=1kwiN(mi,Σi),ν=j=1vjN(nj,Λj),\mu = \sum_{i=1}^k w_i\,\mathcal{N}(m_i, \Sigma_i), \quad \nu = \sum_{j=1}^\ell v_j\,\mathcal{N}(n_j, \Lambda_j),9 where DSMW refers to the integral over sliced SW, and MSW is a minimization over couplings of SW costs.

Additional equivalence properties include:

  • Strong two-sided bounds between SMW and DSMW on compact sets of GMMs, i.e., w,vw, v0 such that for all GMMs in a compact set w,vw, v1, w,vw, v2.
  • No universal two-sided bound between MW and its max or averaged sliced versions, but always w,vw, v3.

This structure ensures that SMW acts as a tight, tractable lower bound to MW for practical purposes.

4. Computational Complexity and Algorithms

The computational advantage of SMW and MixSW arises from two sources:

  1. Projection to 1D: Each projection reduces the comparison of two high-dimensional GMMs to that of their 1D projections, which admits closed-form computation using 1D Gaussian formulas and efficient assignment solvers.
  2. Sampling Efficiency: Methods operate by sampling w,vw, v4 directions w,vw, v5 and computing w,vw, v6, approximating SMW via averaging, and MixSW via maximization.

The resulting algorithm scales as w,vw, v7 versus w,vw, v8 for traditional MW evaluated via high-dimensional linear assignment and matrix square roots. In practice, w,vw, v9, yielding 10–100mi,njm_i, n_j0 speedups for mi,njm_i, n_j1 in the range mi,njm_i, n_j2–mi,njm_i, n_j3 and ambient dimension mi,njm_i, n_j4 up to mi,njm_i, n_j5.

5. Practical Considerations and Experimental Validation

Empirically, SMW and MixSW perform on par with MW across applications:

  • Clustering and cluster-number detection: The MixSW distance graph between GMMs of size mi,njm_i, n_j6 and mi,njm_i, n_j7 exhibits a pronounced drop at the true number of clusters.
  • Perceptual metrics: Replacing MW by MixSW in GMM-based perceptual distances, including adaptations of WaM (Wasserstein-means), closely reproduces FID curves with a \sim44%%%%433%%%% runtime reduction.
  • GMM quantization and minimization: Auto-differentiation through MixSW enables direct optimization of GMM approximations, with rapid convergence and step costs well below a second per iteration for moderately sized mixtures.

Choice of hyperparameters impacts practical performance:

  • Number of projections Σi,Λj\Sigma_i, \Lambda_j0 (100–500 typical)
  • Uniform versus quasi–Monte Carlo slice sampling for variance reduction
  • Stopping criteria (e.g., stabilization of Σi,Λj\Sigma_i, \Lambda_j1 within a given tolerance)

6. Adaptive Slicing and Learning the Slice Distribution

The SMix-W formalism generalizes to arbitrary (possibly data-dependent) slice distributions Σi,Λj\Sigma_i, \Lambda_j2 on Σi,Λj\Sigma_i, \Lambda_j3. The adaptive (mixed) Sliced Wasserstein (Mix-SW) distance is

Σi,Λj\Sigma_i, \Lambda_j4

for probability measures Σi,Λj\Sigma_i, \Lambda_j5. This expressive framework is enabled by PAC-Bayesian generalization bounds that hold uniformly over all Σi,Λj\Sigma_i, \Lambda_j6 (Ohana et al., 2022).

Optimizing the slice distribution Σi,Λj\Sigma_i, \Lambda_j7 for maximal discrimination can be done via (i) parametric families such as von Mises–Fisher on Σi,Λj\Sigma_i, \Lambda_j8, or (ii) push-forwarding angular distributions through neural networks. The learning procedure seeks

Σi,Λj\Sigma_i, \Lambda_j9

where MW(μ,ν)=minπΠ(w,v)[i=1kj=1πijW22(N(mi,Σi),N(nj,Λj))]1/2\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}0 is the Kullback-Leibler divergence from a reference (e.g., uniform) distribution. Regularization via MW(μ,ν)=minπΠ(w,v)[i=1kj=1πijW22(N(mi,Σi),N(nj,Λj))]1/2\mathrm{MW}(\mu,\nu) = \min_{\pi\in \Pi(w,v)} \left[ \sum_{i=1}^k \sum_{j=1}^\ell \pi_{ij}\, W_2^2\big(\mathcal{N}(m_i,\Sigma_i),\,\mathcal{N}(n_j,\Lambda_j)\big) \right]^{1/2}1 ensures generalization and avoids overfitting to a particular slice.

Empirical results indicate that learned or data-adaptive slicing distributions yield more discriminative and generalizable distances compared to uniform or max-slice approaches, especially in high dimensions or low sample size regimes.

7. Significance and Future Directions

The sliced mixture Wasserstein framework provides a rigorous, scalable, and discriminative alternative to classical OT-based metrics for GMMs and empirical measures. Key advantages include computational tractability, provable metric properties, flexibility to adapt slice distributions, and empirical efficacy across clustering, generative modeling, and perceptual similarity tasks. Ongoing research focuses on improving the adaptivity and robustness of slice-distribution learning, exploring non-Gaussian mixtures, and extending PAC-Bayesian generalization analysis to broader classes of optimal transport-inspired metrics (Piening et al., 11 Apr 2025, Ohana et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sliced Mixture Wasserstein (SMix-W).