Papers
Topics
Authors
Recent
Search
2000 character limit reached

Kernel Quantile Embeddings and Associated Probability Metrics

Published 26 May 2025 in stat.ML, cs.LG, math.ST, and stat.TH | (2505.20433v1)

Abstract: Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions as mean functions in RKHS. However, it remains unclear if the mean function is the only meaningful RKHS representation. Inspired by generalised quantiles, we introduce the notion of kernel quantile embeddings (KQEs). We then use KQEs to construct a family of distances that: (i) are probability metrics under weaker kernel conditions than MMD; (ii) recover a kernelised form of the sliced Wasserstein distance; and (iii) can be efficiently estimated with near-linear cost. Through hypothesis testing, we show that these distances offer a competitive alternative to MMD and its fast approximations.

Summary

  • The paper introduces kernel quantile embeddings (KQEs) to represent probability distributions in RKHS, proving injectivity under milder conditions than conventional methods.
  • It develops new probability metrics—e-KQD and sup-KQD—that recover sliced Wasserstein distances and interpolate between MMD and sliced Wasserstein frameworks.
  • The work presents near-linear time estimators with rigorous theoretical guarantees, demonstrating competitive performance in high-dimensional two-sample hypothesis testing.

This paper introduces Kernel Quantile Embeddings (KQEs) as a novel way to represent probability distributions in a Reproducing Kernel Hilbert Space (RKHS), offering an alternative to the widely used Kernel Mean Embeddings (KMEs). While KMEs represent a distribution as the mean function in an RKHS, KQEs leverage the concept of directional quantiles of the feature map xk(x,)x \mapsto k(x, \cdot). This approach is motivated by the fact that the set of all quantiles fully characterizes a probability distribution in one dimension.

The core idea is to first map data points from the input space XX into the RKHS HH via the kernel feature map ψ(x)=k(x,)\psi(x) = k(x, \cdot). This transforms the probability measure PP on XX into a pushforward measure ψ#P\psi \# P on HH. A KQE of PP for a given quantile level α[0,1]\alpha \in [0, 1] and a direction uu in the unit sphere SHS_H of the RKHS is defined as the α\alpha-quantile of the projected measure ϕu#(ψ#P)\phi_u \# (\psi \# P) along the direction uu, where ϕu(h)=u,hH\phi_u(h) = \langle u, h \rangle_H is the projection operator in HH. This results in an element ρPα,uH\rho_P^{\alpha,u} \in H (Equation 6) defined via its evaluation function ρPα,u(x)=ρu#Pαu(x)\rho_P^{\alpha,u}(x) = \rho^\alpha_{u \# P} u(x).

A key theoretical contribution is the demonstration that a kernel kk is "quantile-characteristic" (meaning the mapping P{ρPα,u:α[0,1],uSH}P \mapsto \{\rho_P^{\alpha,u} : \alpha \in [0, 1], u \in S_H\} is injective) under weaker conditions (Hausdorff, separable, σ\sigma-compact input space XX and continuous, separating kernel kk) than those required for a kernel to be mean-characteristic (Theorem 1 and Theorem 2) (2505.20433). This has practical implications, as it means methods based on comparing KQEs can distinguish between a broader class of distributions than methods based on comparing KMEs, such as the Maximum Mean Discrepancy (MMD).

Based on KQEs, the paper proposes a family of probability metrics called Kernel Quantile Discrepancies (KQDs). Two primary types are introduced (Equation 9):

  1. Expected KQD (e-KQD): Averages the difference between KQEs over directions uSHu \in S_H according to a measure γ\gamma. $e-KQD_p(P, Q; \nu, \gamma) = \left(E_{u \sim \gamma} \left[\int_0^1 \big\| \rho_P^{\alpha,u} - \rho_Q^{\alpha,u} \big\|_H^p \nu(d \alpha) \right]\right)^{\nicefrac{1}{p}}$.
  2. Supremum KQD (sup-KQD): Takes the supremum of the difference between KQEs over directions uSHu \in S_H. $sup-KQD_p(P, Q; \nu) = \big(\sup_{u \in S_H} \int_0^1 \big\| \rho_P^{\alpha,u} - \rho_Q^{\alpha,u} \big\|_H^p \nu(d \alpha) \big)^{\nicefrac{1}{p}}$. Here, ν\nu is a weighting measure on [0,1][0, 1] for different quantile levels α\alpha. The paper shows that both e-KQD and sup-KQD are probability metrics under the same mild conditions as quantile-characteristic kernels (Theorem 4) (2505.20433).

The paper establishes connections between KQDs and existing probability metrics:

  • When using a linear kernel k(x,y)=xyk(x, y) = x^\top y and taking ν\nu as the Lebesgue measure, KQDs recover kernelized forms of Sliced Wasserstein (SW) and Max-Sliced Wasserstein (max-SW) distances (Connections 1 and 2) (2505.20433).
  • Centered versions of KQDs relate to a sum of MMD and kernelized sliced Wasserstein distances, suggesting they can interpolate between MMD and SW (Connection 3) (2505.20433).

A significant practical contribution is the development of an efficient estimator for e-KQD, particularly for γ\gamma being a Gaussian measure on HH. Estimating the directional quantile ρPα,u\rho_P^{\alpha,u} empirically involves computing the α\alpha-quantile of {u(xi)}i=1n\{u(x_i)\}_{i=1}^n for samples x1:nPx_{1:n} \sim P, which can be done efficiently using order statistics. The paper provides a consistency guarantee for this empirical KQE estimator (Theorem 3) (2505.20433), showing an O(n1/2)O(n^{-1/2}) convergence rate under mild conditions.

The e-KQD estimator, presented in Algorithm 1, approximates the expectation over directions uγu \sim \gamma using Monte Carlo sampling. To sample uSHu \in S_H from a Gaussian-induced measure γ\gamma, the paper leverages the fact that sampling from a Gaussian measure on HH with a specific integral covariance operator can be reduced to sampling from a standard Gaussian in Rm\mathbb{R}^m and using samples z1:mz_{1:m} from a reference measure ξ\xi on XX (Proposition 1) (2505.20433). The estimator then computes the quantile differences for each sampled direction and averages them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Algorithm 1 Gaussian e-KQD Estimator (Simplified)
Input: Data x_1:n ~ P, y_1:n ~ Q, reference samples z_1:m ~ xi, kernel k, density f_nu, number of projections l, power p.
Initialize e-KQD^p = 0
For i = 1 to l:
  Sample lambda_1:m ~ N(0, Id_m)
  Compute f_i_x = lambda_1:m^T * k(z_1:m, x_1:n) / sqrt(m)  # Vector of values u_i(x_j) up to scale
  Compute f_i_y = lambda_1:m^T * k(z_1:m, y_1:n) / sqrt(m)  # Vector of values u_i(y_j) up to scale
  Compute ||f_i||_H = sqrt(lambda_1:m^T * k(z_1:m, z_1:m) * lambda_1:m / m) # Norm up to scale
  Compute u_i_x = f_i_x / ||f_i||_H  # Actual projected values u_i(x_j)
  Compute u_i_y = f_i_y / ||f_i||_H  # Actual projected values u_i(y_j)
  Sort u_i_x and u_i_y to get order statistics [u_i(x_1:n)]_j and [u_i(y_1:n)]_j
  Initialize tau_p_i^p = 0
  For j = 1 to n:
    tau_p_i^p += (| [u_i(x_1:n)]_j - [u_i(y_1:n)]_j |)^p * f_nu(ceil(j/n))
  e-KQD^p += tau_p_i^p / l
Return e-KQD^p^(1/p)

The computational complexity of this Gaussian e-KQD estimator is analyzed. With l=O(logn)l = O(\log n) projections and m=O(logn)m = O(\log n) reference samples, computing the projected values ui(x1:n)u_i(x_{1:n}) and ui(y1:n)u_i(y_{1:n}) takes O(nm)O(nm) time, computing the norm fiH\|f_i\|_H takes O(m2)O(m^2), and sorting takes O(nlogn)O(n \log n). Summing over ll projections gives a total complexity of O(lmax(nm,m2,nlogn))O(l \max(nm, m^2, n \log n)). By setting l=m=O(logn)l=m=O(\log n), the complexity becomes O(nlog2n)O(n \log^2 n), which is near-linear in nn. This is significantly more efficient than the O(n2)O(n^2) complexity of standard U-statistic MMD estimators or O(Tnlogn)O(T n \log n) for optimizing max-SW/max-GSW, though generally slower than the O(n)O(n) MMD-Linear estimator. The paper also provides a finite-sample consistency guarantee for the empirical e-KQD estimator, showing an O(l1/2+n1/2)O(l^{-1/2} + n^{-1/2}) rate (Theorem 5) (2505.20433).

The paper evaluates the proposed KQDs in the practical application of nonparametric two-sample hypothesis testing, comparing their performance (measured by rejection rate) against MMD and its fast approximations on synthetic and real-world datasets.

  • Power-decay experiment: e-KQD demonstrates better robustness to increasing dimensionality compared to MMD-Multi (a fast MMD approximation of similar complexity).
  • Laplace vs. Gaussian experiment: Using a polynomial kernel (which is not mean-characteristic but is quantile-characteristic), KQDs successfully distinguish between a Gaussian and a Laplace distribution with matching low-order moments, while MMD fails. This empirically verifies the theoretical finding on weaker characteristic conditions.
  • Real-world image data (Galaxy MNIST, CIFAR): On high-dimensional image data, the near-linear time e-KQD and sup-KQD estimators are competitive with or outperform fast MMD estimators of similar complexity. The quadratic-time centered e-KQD performs similarly to quadratic-time MMD.

The experimental results highlight that KQDs offer a compelling alternative to MMD for two-sample testing, providing competitive performance, particularly in high dimensions and scenarios where the kernel might not be mean-characteristic, while enabling efficient estimation. Future work could explore optimizing the choice of weighting measures ν\nu and ξ\xi, developing improved estimators for KQEs, and extending the concepts to conditional settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps, limitations, and open questions that remain unresolved and could guide future research.

  • Necessity vs. sufficiency of assumptions: Theoretical results rely on Hausdorff, separable, σ-compact input spaces and continuous, separating kernels. It is unclear which conditions are truly necessary and which can be relaxed (e.g., non-separable/Non-σ-compact spaces, discontinuous kernels, non-separating but structured kernels), and how results extend to Radon measures in more general topological settings.
  • Characterizing quantile-characteristic kernels: While every mean-characteristic kernel is quantile-characteristic, a full characterisation of kernels that are quantile-characteristic (but not mean-characteristic) is missing. Practical, verifiable criteria for common non-Euclidean kernels (e.g., graph, string, manifold kernels) are not provided.
  • Metric topology and convergence properties: Whether e-KQD and sup-KQD metrize weak convergence (as MMD does under characteristic kernels) is not established. The induced topologies, continuity in distribution under perturbations, and their relationship to standard modes of convergence remain open.
  • Conditions for consistency beyond densities: Consistency of empirical KQEs (Theorem 3) assumes a strictly positive density for each projected measure u#P. Extending rates and guarantees to discrete, mixed, heavy-tailed, or singular distributions—where densities may not exist or be bounded away from zero—is unresolved.
  • p>1 rate conditions and verification: The extension of finite-sample rates for KQDs when p>1 requires integrability of a nontrivial functional J_p involving level-set volumes. Practical criteria to verify these conditions for common kernels and data distributions are lacking.
  • Centering and equivariance: The impact of using uncentered KQEs (which violate location equivariance) on statistical procedures is only informally justified (cancellation “when comparing two distributions”). Formal results clarifying invariances, failure modes, and when centering is necessary for specific tasks are needed.
  • Supremum approximation quality: sup-KQD is computed by taking the maximum over finitely many sampled directions rather than solving a true supremum. There are no approximation guarantees (e.g., uniform convergence rates, covering number bounds on S_H, error bounds vs. the true sup), nor convergence analyses as l→∞.
  • Choice and effect of the direction measure γ: In infinite-dimensional RKHSs, uniform measures on S_H do not exist; Gaussian measures projected onto S_H are used instead. The statistical and geometric implications of this choice (bias in direction sampling, sensitivity to covariance operator C, and whether γ has full support in practice) are not analysed. With empirical ξ and finite m, γ may not have full support, and e-KQD may cease to be a true metric—quantifying this degradation and its impact on testing is open.
  • Role and selection of the weighting measure ν: The measure ν on quantile levels controls what parts of the distribution are emphasised, but there is no guidance on how to choose ν (uniform vs. targeted intervals, tails, robustness) or how ν affects metric properties, power, and consistency.
  • Reference measure ξ selection: The Gaussian sampling scheme uses an integral-operator covariance defined by ξ; in experiments ξ is the empirical mixture of P_n and Q_n. Theoretical and empirical effects of different choices of ξ (data-dependent vs. prior, continuous vs. empirical, support coverage issues) on bias, variance, metric validity, and test power are unstudied.
  • Computational trade-offs and parameter tuning: The near-linear estimator depends on l (number of directions) and m (reference samples) chosen as log n. Optimal l and m schedules (minimax or oracle rates), error–cost trade-offs, and principled adaptive selection strategies are not developed.
  • Norm computation bottleneck: Computing ∥f∥_H incurs O(m2) cost; scalable alternatives (e.g., fast low-rank updates, random feature norms, sketching) and their effect on estimator bias/variance are not explored.
  • Quantile computation scalability: The estimator sums across all order statistics, requiring sorting O(n log n) per direction. Alternatives (e.g., multi-quantile sketches, selection-based partial quantiles) to reduce sorting cost while preserving accuracy are not investigated.
  • Analytical null distributions: Tests rely on permutation thresholds. Asymptotic distributions (CLTs) of e-KQD/sup-KQD under H0 and H1, and bootstrap procedures enabling analytic p-values and confidence intervals are not provided.
  • Sensitivity to kernel hyperparameters: As with MMD, KQD performance depends on kernel choice and bandwidth (median heuristic used). Systematic kernel selection/tuning (e.g., power maximization, cross-validation, data-driven bandwidths) for KQDs is not developed.
  • Extensions to conditional and structured settings: The paper focuses on marginal distributions. Open questions include conditional KQEs/KQDs (for conditional independence testing, causal inference), operator-valued kernels, and embeddings for distributions on complex structured spaces (graphs, sequences, manifolds) with established computational and statistical guarantees.
  • Robustness properties: Quantiles are robust to outliers, but the robustness of KQDs (choice of ν, trimming, influence functions, breakdown points) is not theoretically or empirically analysed.
  • Connections to Wasserstein/GSW: While KQDs recover sliced and max-sliced Wasserstein in special cases, quantitative equivalence bounds, conditions for exact recovery with non-linear kernels, and error analysis versus SW/GSW as d or RKHS dimension grows remain to be derived.
  • Interpolation with Sinkhorn: The stated “mid-point interpolant” interpretation for centered KQDs (MMD + kernel-sliced Wasserstein) lacks a formal interpolation parameter and comparative theory versus entropic regularization (e.g., bias, convergence, sample complexity, and limiting regimes).
  • Graph and non-characteristic kernels: Although many graph kernels are not mean-characteristic, it remains untested whether they are quantile-characteristic in practice. Empirical and theoretical investigations on real graph/structured datasets are needed.
  • Practical guidance for ν, γ, and ξ: The method introduces three measures (quantile weights ν, direction measure γ, and reference ξ), but provides no principled recipes for their joint selection tailored to task/data, nor sensitivity analyses identifying their relative impact.
  • Finite-sample metric validity: e-KQD is a distance when ν, γ have full support; the finite-sample Monte Carlo estimator uses finite l and finite-rank γ_m. Formal results establishing when the estimator defines a pseudo-metric vs. a true metric, and how quickly metric validity is recovered as l, m increase, are missing.
  • Application breadth and benchmarks: Experiments mostly use Euclidean data and a small set of tasks; broader benchmarks (structured/non-Euclidean domains, generative modeling evaluation, domain adaptation, distribution regression) and comparisons against state-of-the-art SW/GSW/Sinkhorn/MMD variants under matched complexity remain to be performed.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 475 likes about this paper.