Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distributional Mean Embeddings

Updated 24 January 2026
  • Distributional mean embeddings are a framework that represents entire probability distributions in RKHS by capturing all moments and dependencies when using characteristic kernels.
  • They enable efficient estimation methods—including plug-in, inverse-probability weighting, and doubly robust strategies—alongside Nyström approximations for scalable computations.
  • Their application spans causal inference, off-policy evaluation, and distributional reinforcement learning by leveraging MMD-based hypothesis tests and full distribution mapping.

Distributional mean embeddings provide a rigorous framework for representing full probability distributions as elements in Hilbert spaces, especially Reproducing Kernel Hilbert Spaces (RKHS). This approach enables nonparametric, high-dimensional, and distribution-sensitive inference, learning, and hypothesis testing. Distinct from classical moment-based summaries, kernel mean embeddings capture all distributional information (including higher moments and dependence structures) provided the kernel is characteristic. Their applications span causal inference, off-policy evaluation, distributional reinforcement learning, density estimation, and efficient large-scale probabilistic computations.

1. Formal Definition and Properties

Let (X,k)(\mathcal{X}, k) denote a measurable space with a positive definite kernel k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R} and associated RKHS Hk\mathcal{H}_k. For any probability law PP on X\mathcal{X} such that  ⁣k(x,x)  dP(x)<\int\! \sqrt{k(x,x)} \; dP(x) < \infty, the (kernel) mean embedding of PP is

μP:=EXP[k(X,)]Hk.\mu_P := \mathbb{E}_{X \sim P}\left[ k(X, \cdot) \right] \in \mathcal{H}_k.

If kk is characteristic (e.g., Gaussian, Matérn), the mapping PμPP \mapsto \mu_P is injective: μP\mu_P uniquely determines PP (Zenati et al., 3 Jun 2025, Fawkes et al., 2022). This embedding respects the geometry of probability distributions, as distances in RKHS (notably, μPμQHk\|\mu_P - \mu_Q\|_{\mathcal{H}_k}) metrize weak convergence and correspond to the maximum mean discrepancy (MMD) (Muzellec et al., 2021, Chatalic et al., 2022).

Empirical estimates from samples {xi}i=1n\{x_i\}_{i=1}^n are given by

μ^P=1ni=1nk(xi,),\hat{\mu}_P = \frac{1}{n} \sum_{i=1}^n k(x_i, \cdot),

which converges in RKHS norm at Op(n1/2)O_p(n^{-1/2}) (Zenati et al., 3 Jun 2025, Chatalic et al., 2022).

2. Methodologies and Algorithms

2.1 Estimation Approaches

Mean embeddings admit several estimation paradigms depending on the problem context:

2.2 Optimization Frameworks

Direct optimization of probability distributions in RKHS is enabled by sums-of-squares (SoS) parameterizations:

  • Given anchor points Z={z1,...,zm}Z = \{z_1, ..., z_m\} and wRmw \in \mathbb{R}^m, define

fw(x)=i,j=1mwiwjk(zi,zj)k(x,zi)k(x,zj)f_w(x) = \sum_{i,j=1}^m w_i w_j k(z_i, z_j) k(x, z_i) k(x, z_j)

enforcing nonnegativity and normalization for densities (Muzellec et al., 2021).

Optimization is performed by minimizing the empirical MMD between samples and SoS-parameterized densities, typically using projected gradient methods.

3. Applications in Causal Inference and Off-Policy Evaluation

Distributional mean embeddings are foundational for nonparametric causal inference beyond average effects:

  • Counterfactual Mean Embedding (CME) & Policy Mean Embedding (CPME): Defines global and treatment-/policy-specific mean embeddings μY(t)\mu_{Y(t)} in RKHS for potential outcomes (Fawkes et al., 2022, Zenati et al., 3 Jun 2025).
  • Doubly Robust Distributional Estimation: Embeddings admit DR estimators

μ^Y(t)DR=1ni=1n{1{Ti=t}e^(Xi,t)(k(Yi,)r^(Xi,t))+r^(Xi,t)}\hat{\mu}_{Y(t)}^{\mathrm{DR}} = \frac{1}{n} \sum_{i=1}^n \left\{ \frac{\mathbb{1}\{T_i = t\}}{\hat{e}(X_i, t)} \left(k(Y_i, \cdot) - \hat{r}(X_i, t)\right) + \hat{r}(X_i, t) \right\}

with Op(n1/2)O_p(n^{-1/2})-consistency if either outcome or propensity models are correct (Fawkes et al., 2022).

4. Distributional Mean Embeddings in Reinforcement Learning

Distributional RL extends classical value functions to capture entire return distributions:

  • Sketch Mean Embedding: States are associated with Uπ(x):=E[ϕ(Gπ(x))]RmU^\pi(x) := \mathbb{E}[\phi(G^\pi(x))] \in \mathbb{R}^m, where ϕ\phi is a feature map over returns. The policy's full return law is thus encoded nonparametrically (Wenliang et al., 2023).
  • Distributional Bellman Operators: Linear “sketch Bellman operators” are derived by approximating feature maps composed with Bellman transformations. If linearization holds (ϕ(r+γg)=Brϕ(g)\phi(r+\gamma g) = B_r \phi(g)), dynamic programming and TD learning admit efficient updates in embedding space (Wenliang et al., 2023).
  • Deep RL Extensions: Nonlinear (deep neural network) parameterizations of Uθ(s,a)U_\theta(s,a), with precomputed read-out vectors for Q-values, allow direct learning of mean embeddings, outperforming classical and quantile-based agents on the Arcade Learning Environment (Wenliang et al., 2023).

5. Efficient Computation and Large-Scale Techniques

The principal computational bottleneck is storage and manipulation of high-/infinite-dimensional embeddings:

  • Nyström Embedding (Low-Rank Compression): Subsampling mm landmarks, the Nyström mean embedding is

μ~P=p=1mapϕ(Xip),a=Kmm+Kmn1nn\tilde{\mu}_P = \sum_{p=1}^m a_p \phi(X_{i_p}), \quad a = K_{mm}^+ \frac{K_{mn} \mathbf{1}_n}{n}

with KmmK_{mm}, KmnK_{mn} Gram matrices on landmarks and samples (Chatalic et al., 2022). Proper selection of mm (e.g., mnlognm \sim \sqrt{n} \log n) yields empirical error O(n1/2)O(n^{-1/2}), and MMD computations among compressed embeddings similarly scale with mm rather than nn.

  • Quadrature Applications: Nyström-approximated mean embeddings support kernel quadrature with explicit error bounds (Chatalic et al., 2022).

6. Limitations and Theoretical Nuances

  • Pre-Image/Representability: Not every element of Hk\mathcal{H}_k is a mean embedding of a probability law. The mean embedding set is a convex hull of feature maps, and approximation of general RKHS elements by mean embeddings is nontrivial (Muzellec et al., 2021).
  • Scalability: While the Nyström method compresses computation, uniform sampling may be statistically suboptimal compared to leverage-score or determinant-based sampling; such alternatives, however, can require challenging auxiliary computations (Chatalic et al., 2022).
  • Expressivity: Optimization in RKHS with low-rank or rank-one parameterizations may under-fit complex multimodal distributions unless sufficient anchors are chosen or higher-rank representations are utilized (Muzellec et al., 2021).
  • Large-Scale Bias/Approximation: Constants hidden in theoretical bounds, spectral decay rates, and implementation details can affect performance; slow decays or high intrinsic dimensionality may necessitate large mm in practice (Chatalic et al., 2022).

7. Broader Impact and Recent Innovations

Distributional mean embeddings enable nonparametric, distribution-sensitive inference in a variety of domains:

  • Causal Inference and Policy Evaluation: The CPME framework generalizes prior CME approaches to general discrete/continuous action spaces, delivers DR estimators with rigorous convergence rates, MMD-based distributional hypothesis tests, and supports counterfactual sampling (Zenati et al., 3 Jun 2025).
  • Distributional RL: Embedding-based sketching achieves state-of-the-art empirical performance and theoretical soundness in estimating full return distributions (Wenliang et al., 2023).
  • Language Representations: Bayesian skip-gram models instantiate distributional mean embeddings for words, capturing uncertainty and polysemy with empirically validated improvements over point estimates (Bražinskas et al., 2017).
  • Density Estimation and Generative Modeling: Optimization of SoS parameterized densities in RKHS enables robust density matching and generative modeling directly in MMD geometry (Muzellec et al., 2021).

In sum, distributional mean embeddings furnish a powerful framework for representing and manipulating entire distributions in function spaces, underpinning a broad spectrum of modern statistical, machine learning, and reinforcement learning methodology (Zenati et al., 3 Jun 2025, Fawkes et al., 2022, Wenliang et al., 2023, Chatalic et al., 2022, Muzellec et al., 2021, Bražinskas et al., 2017).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distributional Mean Embeddings.