Cluster Maximum Mean Discrepancy (CMMD)

Updated 30 January 2026

Cluster Maximum Mean Discrepancy (CMMD) is a kernel-based statistical tool that partitions feature space into latent clusters to enable precise distribution alignment.
It extends the maximum mean discrepancy (MMD) by minimizing divergences between each latent cluster and a labeled anchor set, enhancing quantization and domain adaptation.
CMMD offers strong theoretical guarantees and empirical success in clustering, segmentation, and mixed-domain tasks by effectively managing high-dimensional, heterogeneous data.

Cluster Maximum Mean Discrepancy (CMMD) is a kernel-based statistical distance developed for robust distribution alignment in high-dimensional, multi-modal, or heterogeneous settings where standard domain labels are unavailable or insufficient. CMMD generalizes the maximum mean discrepancy (MMD) criterion through the identification of latent clusters in feature space and the subsequent minimization of MMD between each cluster and a labeled anchor set. This extension addresses the limitations of pointwise or global MMD alignment in mixed-domain, semi-supervised, and clustering tasks, yielding principled and empirically effective algorithms for quantization, clustering, and domain adaptation (Belhadji et al., 14 Feb 2025, Lam et al., 23 Jan 2026).

1. Mathematical Foundations: MMD and Its Extension to CMMD

Maximum Mean Discrepancy (MMD) assesses the divergence between two distributions $P$ and $Q$ on a space $\mathcal{X}$ as the squared distance between their reproducing kernel Hilbert space (RKHS) mean embeddings: $\mathrm{MMD}^2(P,Q) = \Big\|\mathbb{E}_{x\sim P}[\phi(x)] - \mathbb{E}_{y\sim Q}[\phi(y)] \Big\|_{\mathcal{H}}^2,$ where $\phi$ is the feature mapping induced by a characteristic kernel $k(\cdot,\cdot)$ . The empirical kernelized form is: $\mathrm{MMD}^2(P, Q) = \mathbb{E}_{x,x'\sim P}[k(x,x')] - 2\,\mathbb{E}_{x\sim P,\,y\sim Q}[k(x,y)] + \mathbb{E}_{y,y'\sim Q}[k(y,y')].$ Typically, $k$ is a Gaussian kernel, $k(x, x') = \exp(-\|x-x'\|^2/2\sigma^2)$ (Lam et al., 23 Jan 2026).

Cluster Maximum Mean Discrepancy (CMMD) extends this framework by partitioning the unlabeled (or particle) feature space into $Q$ clusters via unsupervised clustering (e.g., HDBSCAN) and minimizing the average MMD between each cluster and a labeled "anchor" distribution. This hierarchical approach operationalizes alignment at the latent domain or mode level rather than globally, thus capturing complex, multi-modal structures and mitigating domain and modality biases (Lam et al., 23 Jan 2026).

2. CMMD in Distribution Quantization and Clustering

Weighted-particle quantization with MMD seeks to approximate a probability measure $\mu\in\mathcal{M}(\mathcal{X})$ by a sum of weighted Dirac measures: $\nu = \sum_{i=1}^M w_i \delta_{x_i},$ where $x_i\in\mathcal{X}$ and $(w_1,\ldots, w_M)\in\mathbb{R}^M$ . The objective is

$\min_{x \in \mathcal{X}^M, w \in \mathbb{R}^M} \frac{1}{2}~ \mathrm{MMD}_k^2\left(\sum_i w_i \delta_{x_i},\ \mu\right).$

This induces emergent clustering behavior, as particles drift toward local density modes under repulsive/attractive kernel interactions. Variable weights $w_i$ encode cluster mass, and unlike $k$ -means, clusters are not restricted to Voronoi cells or convex regions. The approach admits a gradient flow interpretation using Wasserstein–Fisher–Rao geometry, discretized by a system of interacting particle ODEs, or solved via the mean-shift interacting particles (MSIP) fixed-point map (Belhadji et al., 14 Feb 2025).

3. CMMD Algorithmic Pipeline in Mixed-domain Representation Alignment

For domain-invariant mixed-domain semi-supervised tasks, the CMMD block is operationalized as follows (Lam et al., 23 Jan 2026):

Feature Extraction and Clustering: Extract teacher-encoder features $f_u^T = E^T(x_u^w)$ for all unlabeled samples. Apply HDBSCAN with $\text{minClusterSize}=10$ , $\text{minSamples}=5$ to obtain $Q$ clusters $C_{u,1},\ldots, C_{u,Q}$ and pseudo-centroids $c_{u,i}$ . Assign student-encoder features $f_u^S = E^S(x_u^w)$ to the nearest cluster centroid.
Anchor Selection: All labeled samples are mapped by the student encoder to form the anchor set $C_{\text{anchor}}$ .
Loss Computation: For encoder layer $\ell$ , compute MMD between each cluster $C_{u,i}^\ell$ (size $m_i$ ) and anchor set (size $n$ ) as

$\mathrm{MMD}_i^2 = \frac{1}{m_i(m_i-1)} \sum_{p\ne q}^{{m_i}} k(\mathbf{c}_p, \mathbf{c}_q) - \frac{2}{m_i n} \sum_{p=1}^{m_i}\sum_{q=1}^{n} k(\mathbf{c}_p, \mathbf{c}_q') + \frac{1}{n(n-1)} \sum_{p\ne q}^{n} k(\mathbf{c}_p', \mathbf{c}_q'),$

with CMMD loss averaged over clusters and layers:

$\mathcal{L}_{\mathrm{domain}} = \frac{1}{L}\sum_{\ell=1}^L \frac{1}{Q} \sum_{i=1}^Q \mathrm{MMD}_i^2.$

Integration: The CMMD domain loss is added to the total objective in a teacher–student semi-supervised framework, alongside supervised segmentation loss and a copy–paste consistency loss, using a weight $w(t)$ with Gaussian ramp-up.

4. Theoretical Analysis and Algorithmic Properties

Key mathematical and algorithmic properties established for CMMD-based quantization and clustering (Belhadji et al., 14 Feb 2025):

Stationarity: Fixed points of the MSIP map are exactly the critical points of the quantization objective $F_M(X) = \min_w \frac{1}{2} \mathrm{MMD}_k^2 (\sum_i w_i \delta_{x_i}, \mu)$ .
Descent Guarantees: In the mean-field limit, Wasserstein–Fisher–Rao flows guarantee monotonic decay of $F(\rho_t)$ , with exponential convergence under mild conditions.
Non-degeneracy: For standard kernels, the ODE and MSIP map preserve particle separation, preventing mode collapse.
Connection to Classical Methods: MSIP generalizes mean-shift (for $M=1$ ), acts as a preconditioned gradient (quasi-Newton) method, and reduces to Lloyd's $k$ -means in the local-kernel limit.

5. Practical Considerations and Hyperparameter Settings

Practical deployment of CMMD entails computational and algorithmic design choices:

Complexity: Per-iteration cost for the quantization algorithm is $\mathcal{O}(M^3 + NM + M^2d)$ , dominated by kernel matrix inversion and feature evaluation (Belhadji et al., 14 Feb 2025).
Clustering: CMMD for feature alignment in mixed domains relies on HDBSCAN with $\text{minClusterSize}=10$ , $\text{minSamples}=5$ for robust cluster discovery (Lam et al., 23 Jan 2026).
Kernel Selection: Kernel bandwidth $\sigma$ is set via the median heuristic per encoder layer.
Loss Weighting: Domain loss ramp-up follows $w(t)=\exp\left(-5(1 - t/t_{\max})^2\right)$ , with $t_{\max}$ set for convergence (e.g., $30k$–$60k$ iterations depending on dataset).

6. Empirical Performance and Comparative Evaluation

Empirical evaluation demonstrates the effectiveness of CMMD in several challenging settings:

Distribution Quantization and Clustering: On synthetic Gaussian mixtures ( $d=2$ and $d=100$ ), MSIP achieves near-optimal MMD in fewer than 100 iterations from random initializations, outperforming competing kernel-based flows under adversarial starts. On MNIST ( $d=784$ ), CMMD-based methods recover interpretable digit prototypes, surpassing Lloyd’s $k$ -means and plain mean-shift on high-dimensional data (Belhadji et al., 14 Feb 2025).
Mixed-domain Segmentation: In semi-supervised medical image segmentation under unknown domain shifts, augmenting a teacher–student pipeline with CMMD increases Dice and Jaccard scores (e.g., Fundus: Dice $86.99\%$ vs $86.89\%$ , Jaccard $78.22\%$ vs $78.07\%$ ), and leads to stronger domain-invariance as shown by UMAP diagnostics (Lam et al., 23 Jan 2026).
Ablation and Visualization: When compared to single-shot MMD alignment, CMMD’s clustering step produces more granular alignment, with clusters corresponding to latent domain or modality structure, confirmed by feature visualization and segmentation metrics.

7. Context: Relationship to Conditional MMD, Kernel Learning, and Open Problems

Conditional Maximum Mean Discrepancy (CMMD, conditional moment-matching discrepancy) measures distances between conditional distributions via RKHS cross-covariance operators, and has been used for aligning conditional structures in discriminative tasks such as classification (Ren et al., 2020). Standard CMMD is sensitive to kernel choice; recent methods (e.g., KLN) learn compound kernels jointly with feature encodings to optimize intra-class and inter-class discrimination.

By contrast, Clustered Maximum Mean Discrepancy differs from operator-based CMMD in its operational focus: it clusters features in an unsupervised fashion and aligns each cluster marginal to an anchor, achieving distribution-level alignment suitable for unsupervised quantization, clustering, or domain adaptation, especially when conditional structure is unknown or unobserved.

A notable open direction, as demonstrated in (Belhadji et al., 14 Feb 2025), is the rigorous theoretical analysis of MSIP and CMMD convergence at finite $M$ , as well as the generalization of these ideas to even richer clustering and adaptation paradigms. The empirical evidence supports CMMD as a robust, domain-agnostic alignment and quantization tool for heterogeneous and weakly labeled datasets.