Sliced Mixture Wasserstein (SMix-W)
- Sliced Mixture Wasserstein (SMix-W) is a distance metric that compares Gaussian mixture models via 1D projections, offering both efficiency and theoretical soundness.
- It integrates mixture Wasserstein costs over random or adaptive slice distributions to reduce the computational complexity while preserving discriminative power.
- Empirical studies show that SMix-W achieves significant speedups and reliable performance in clustering, generative modeling, and domain adaptation tasks.
The Sliced Mixture Wasserstein (SMix-W) distance is a computationally efficient metric for comparing Gaussian mixture models (GMMs) and, more generally, high-dimensional probability distributions. SMix-W, also known as SMW or integrated-slice mixture Wasserstein, serves as a lower-complexity alternative to the original Mixture Wasserstein (MW) distance, offering significant computational speedups while maintaining discriminative capacity and key metric properties. The SMix-W and its max-sliced variant (MixSW) have been established as practical and theoretically sound metrics for clustering, generative modeling, domain adaptation, and related large-scale machine learning tasks involving GMMs and empirical measures (Piening et al., 11 Apr 2025, Ohana et al., 2022).
1. Foundations: Mixture and Sliced Wasserstein Metrics
Given two GMMs in ,
with mixture weights , means , and covariances , the Mixture Wasserstein distance is defined as
with the closed-form 2-Wasserstein distance between two Gaussians and a coupling.
The Sliced Wasserstein distance (SW) between measures is given by
where 0 denotes the projection of 1 onto the line direction 2.
The Sliced Mixture Wasserstein (SMix-W or SMW) interpolates between these constructions, integrating the mixture Wasserstein cost over random or uniform projections, yielding significant computational efficiencies.
2. Definitions: SMix-W, Mix-SW, and Variants
The SMix-W distance is defined for GMMs via integration over the unit sphere: 3 where 4 denotes the 2-Wasserstein distance between the projected (1D) GMMs.
The max-sliced variant, MixSW, is given by
5
by concentrating the entire slice distribution 6 on the maximal direction.
Table 1 summarizes key variants and their measures of integration:
| Name | Slice Distribution 7 | Operation |
|---|---|---|
| SMW | Uniform on 8 | Averaging/integral |
| MixSW | Dirac at best direction | Maximum |
| Random-slice MW | Empirical (randomized) | Sampled averaging or max |
3. Theoretical Properties and Metric Relations
The SMW, MixSW, and related sliced mixture metrics are bona-fide distances and satisfy the following monotonicity chain (Piening et al., 11 Apr 2025): 9 where DSMW refers to the integral over sliced SW, and MSW is a minimization over couplings of SW costs.
Additional equivalence properties include:
- Strong two-sided bounds between SMW and DSMW on compact sets of GMMs, i.e., 0 such that for all GMMs in a compact set 1, 2.
- No universal two-sided bound between MW and its max or averaged sliced versions, but always 3.
This structure ensures that SMW acts as a tight, tractable lower bound to MW for practical purposes.
4. Computational Complexity and Algorithms
The computational advantage of SMW and MixSW arises from two sources:
- Projection to 1D: Each projection reduces the comparison of two high-dimensional GMMs to that of their 1D projections, which admits closed-form computation using 1D Gaussian formulas and efficient assignment solvers.
- Sampling Efficiency: Methods operate by sampling 4 directions 5 and computing 6, approximating SMW via averaging, and MixSW via maximization.
The resulting algorithm scales as 7 versus 8 for traditional MW evaluated via high-dimensional linear assignment and matrix square roots. In practice, 9, yielding 10–1000 speedups for 1 in the range 2–3 and ambient dimension 4 up to 5.
5. Practical Considerations and Experimental Validation
Empirically, SMW and MixSW perform on par with MW across applications:
- Clustering and cluster-number detection: The MixSW distance graph between GMMs of size 6 and 7 exhibits a pronounced drop at the true number of clusters.
- Perceptual metrics: Replacing MW by MixSW in GMM-based perceptual distances, including adaptations of WaM (Wasserstein-means), closely reproduces FID curves with a 44%%%%433%%%% runtime reduction.
- GMM quantization and minimization: Auto-differentiation through MixSW enables direct optimization of GMM approximations, with rapid convergence and step costs well below a second per iteration for moderately sized mixtures.
Choice of hyperparameters impacts practical performance:
- Number of projections 0 (100–500 typical)
- Uniform versus quasi–Monte Carlo slice sampling for variance reduction
- Stopping criteria (e.g., stabilization of 1 within a given tolerance)
6. Adaptive Slicing and Learning the Slice Distribution
The SMix-W formalism generalizes to arbitrary (possibly data-dependent) slice distributions 2 on 3. The adaptive (mixed) Sliced Wasserstein (Mix-SW) distance is
4
for probability measures 5. This expressive framework is enabled by PAC-Bayesian generalization bounds that hold uniformly over all 6 (Ohana et al., 2022).
Optimizing the slice distribution 7 for maximal discrimination can be done via (i) parametric families such as von Mises–Fisher on 8, or (ii) push-forwarding angular distributions through neural networks. The learning procedure seeks
9
where 0 is the Kullback-Leibler divergence from a reference (e.g., uniform) distribution. Regularization via 1 ensures generalization and avoids overfitting to a particular slice.
Empirical results indicate that learned or data-adaptive slicing distributions yield more discriminative and generalizable distances compared to uniform or max-slice approaches, especially in high dimensions or low sample size regimes.
7. Significance and Future Directions
The sliced mixture Wasserstein framework provides a rigorous, scalable, and discriminative alternative to classical OT-based metrics for GMMs and empirical measures. Key advantages include computational tractability, provable metric properties, flexibility to adapt slice distributions, and empirical efficacy across clustering, generative modeling, and perceptual similarity tasks. Ongoing research focuses on improving the adaptivity and robustness of slice-distribution learning, exploring non-Gaussian mixtures, and extending PAC-Bayesian generalization analysis to broader classes of optimal transport-inspired metrics (Piening et al., 11 Apr 2025, Ohana et al., 2022).