Distributional Mean Embeddings

Updated 24 January 2026

Distributional mean embeddings are a framework that represents entire probability distributions in RKHS by capturing all moments and dependencies when using characteristic kernels.
They enable efficient estimation methods—including plug-in, inverse-probability weighting, and doubly robust strategies—alongside Nyström approximations for scalable computations.
Their application spans causal inference, off-policy evaluation, and distributional reinforcement learning by leveraging MMD-based hypothesis tests and full distribution mapping.

Distributional mean embeddings provide a rigorous framework for representing full probability distributions as elements in Hilbert spaces, especially Reproducing Kernel Hilbert Spaces (RKHS). This approach enables nonparametric, high-dimensional, and distribution-sensitive inference, learning, and hypothesis testing. Distinct from classical moment-based summaries, kernel mean embeddings capture all distributional information (including higher moments and dependence structures) provided the kernel is characteristic. Their applications span causal inference, off-policy evaluation, distributional reinforcement learning, density estimation, and efficient large-scale probabilistic computations.

1. Formal Definition and Properties

Let $(\mathcal{X}, k)$ denote a measurable space with a positive definite kernel $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ and associated RKHS $\mathcal{H}_k$ . For any probability law $P$ on $\mathcal{X}$ such that $\int\! \sqrt{k(x,x)} \; dP(x) < \infty$ , the (kernel) mean embedding of $P$ is

$\mu_P := \mathbb{E}_{X \sim P}\left[ k(X, \cdot) \right] \in \mathcal{H}_k.$

If $k$ is characteristic (e.g., Gaussian, Matérn), the mapping $P \mapsto \mu_P$ is injective: $\mu_P$ uniquely determines $P$ (Zenati et al., 3 Jun 2025, Fawkes et al., 2022). This embedding respects the geometry of probability distributions, as distances in RKHS (notably, $\|\mu_P - \mu_Q\|_{\mathcal{H}_k}$ ) metrize weak convergence and correspond to the maximum mean discrepancy (MMD) (Muzellec et al., 2021, Chatalic et al., 2022).

Empirical estimates from samples $\{x_i\}_{i=1}^n$ are given by

$\hat{\mu}_P = \frac{1}{n} \sum_{i=1}^n k(x_i, \cdot),$

which converges in RKHS norm at $O_p(n^{-1/2})$ (Zenati et al., 3 Jun 2025, Chatalic et al., 2022).

2. Methodologies and Algorithms

2.1 Estimation Approaches

Mean embeddings admit several estimation paradigms depending on the problem context:

IPW (Inverse-Probability Weighting) & Regression-Only: In causal inference, counterfactual mean embeddings can be estimated by IPW, regression, or their combinations (see Section 2.2 below) (Fawkes et al., 2022).
Plug-in Estimator: For conditional mean embeddings $C_{Y|A,X}$ , kernel ridge regression is used to estimate conditional operators, producing plug-in estimators for, e.g., counterfactual policy mean embeddings (CPME) (Zenati et al., 3 Jun 2025).
Doubly Robust (DR) Estimation: By incorporating estimators for both outcome and propensity models, DR estimators maintain consistency if either model is correctly specified. In embedding space, they yield improved convergence rates and first-order bias correction (Zenati et al., 3 Jun 2025, Fawkes et al., 2022).
Nyström Approximation: For large scale, empirical mean embeddings can be compressed via the Nyström method: projecting onto a subspace spanned by $m \ll n$ "landmark" points, with provable $O(n^{-1/2})$ error rates under spectral conditions (Chatalic et al., 2022).

2.2 Optimization Frameworks

Direct optimization of probability distributions in RKHS is enabled by sums-of-squares (SoS) parameterizations:

Given anchor points $Z = \{z_1, ..., z_m\}$ and $w \in \mathbb{R}^m$ , define

$f_w(x) = \sum_{i,j=1}^m w_i w_j k(z_i, z_j) k(x, z_i) k(x, z_j)$

enforcing nonnegativity and normalization for densities (Muzellec et al., 2021).

Optimization is performed by minimizing the empirical MMD between samples and SoS-parameterized densities, typically using projected gradient methods.

3. Applications in Causal Inference and Off-Policy Evaluation

Distributional mean embeddings are foundational for nonparametric causal inference beyond average effects:

Counterfactual Mean Embedding (CME) & Policy Mean Embedding (CPME): Defines global and treatment-/policy-specific mean embeddings $\mu_{Y(t)}$ in RKHS for potential outcomes (Fawkes et al., 2022, Zenati et al., 3 Jun 2025).
Doubly Robust Distributional Estimation: Embeddings admit DR estimators

$\hat{\mu}_{Y(t)}^{\mathrm{DR}} = \frac{1}{n} \sum_{i=1}^n \left\{ \frac{\mathbb{1}\{T_i = t\}}{\hat{e}(X_i, t)} \left(k(Y_i, \cdot) - \hat{r}(X_i, t)\right) + \hat{r}(X_i, t) \right\}$

with $O_p(n^{-1/2})$ -consistency if either outcome or propensity models are correct (Fawkes et al., 2022).

Permutation Tests: Testing for causal or policy effects on the entire distribution employs the MMD between counterfactual embeddings as a test statistic (Fawkes et al., 2022, Zenati et al., 3 Jun 2025). Asymptotically normal, DR kernel-based tests support confidence intervals for distributional quantities (Zenati et al., 3 Jun 2025).
Generating Pseudo-Samples: Kernel herding allows generation of pseudo-samples from the embedded distribution, enabling visualization and downstream tasks (Zenati et al., 3 Jun 2025).

4. Distributional Mean Embeddings in Reinforcement Learning

Distributional RL extends classical value functions to capture entire return distributions:

Sketch Mean Embedding: States are associated with $U^\pi(x) := \mathbb{E}[\phi(G^\pi(x))] \in \mathbb{R}^m$ , where $\phi$ is a feature map over returns. The policy's full return law is thus encoded nonparametrically (Wenliang et al., 2023).
Distributional Bellman Operators: Linear “sketch Bellman operators” are derived by approximating feature maps composed with Bellman transformations. If linearization holds ( $\phi(r+\gamma g) = B_r \phi(g)$ ), dynamic programming and TD learning admit efficient updates in embedding space (Wenliang et al., 2023).
Deep RL Extensions: Nonlinear (deep neural network) parameterizations of $U_\theta(s,a)$ , with precomputed read-out vectors for Q-values, allow direct learning of mean embeddings, outperforming classical and quantile-based agents on the Arcade Learning Environment (Wenliang et al., 2023).

5. Efficient Computation and Large-Scale Techniques

The principal computational bottleneck is storage and manipulation of high-/infinite-dimensional embeddings:

Nyström Embedding (Low-Rank Compression): Subsampling $m$ landmarks, the Nyström mean embedding is

$\tilde{\mu}_P = \sum_{p=1}^m a_p \phi(X_{i_p}), \quad a = K_{mm}^+ \frac{K_{mn} \mathbf{1}_n}{n}$

with $K_{mm}$ , $K_{mn}$ Gram matrices on landmarks and samples (Chatalic et al., 2022). Proper selection of $m$ (e.g., $m \sim \sqrt{n} \log n$ ) yields empirical error $O(n^{-1/2})$ , and MMD computations among compressed embeddings similarly scale with $m$ rather than $n$ .

Quadrature Applications: Nyström-approximated mean embeddings support kernel quadrature with explicit error bounds (Chatalic et al., 2022).

6. Limitations and Theoretical Nuances

Pre-Image/Representability: Not every element of $\mathcal{H}_k$ is a mean embedding of a probability law. The mean embedding set is a convex hull of feature maps, and approximation of general RKHS elements by mean embeddings is nontrivial (Muzellec et al., 2021).
Scalability: While the Nyström method compresses computation, uniform sampling may be statistically suboptimal compared to leverage-score or determinant-based sampling; such alternatives, however, can require challenging auxiliary computations (Chatalic et al., 2022).
Expressivity: Optimization in RKHS with low-rank or rank-one parameterizations may under-fit complex multimodal distributions unless sufficient anchors are chosen or higher-rank representations are utilized (Muzellec et al., 2021).
Large-Scale Bias/Approximation: Constants hidden in theoretical bounds, spectral decay rates, and implementation details can affect performance; slow decays or high intrinsic dimensionality may necessitate large $m$ in practice (Chatalic et al., 2022).

7. Broader Impact and Recent Innovations

Distributional mean embeddings enable nonparametric, distribution-sensitive inference in a variety of domains:

Causal Inference and Policy Evaluation: The CPME framework generalizes prior CME approaches to general discrete/continuous action spaces, delivers DR estimators with rigorous convergence rates, MMD-based distributional hypothesis tests, and supports counterfactual sampling (Zenati et al., 3 Jun 2025).
Distributional RL: Embedding-based sketching achieves state-of-the-art empirical performance and theoretical soundness in estimating full return distributions (Wenliang et al., 2023).
Language Representations: Bayesian skip-gram models instantiate distributional mean embeddings for words, capturing uncertainty and polysemy with empirically validated improvements over point estimates (Bražinskas et al., 2017).
Density Estimation and Generative Modeling: Optimization of SoS parameterized densities in RKHS enables robust density matching and generative modeling directly in MMD geometry (Muzellec et al., 2021).

In sum, distributional mean embeddings furnish a powerful framework for representing and manipulating entire distributions in function spaces, underpinning a broad spectrum of modern statistical, machine learning, and reinforcement learning methodology (Zenati et al., 3 Jun 2025, Fawkes et al., 2022, Wenliang et al., 2023, Chatalic et al., 2022, Muzellec et al., 2021, Bražinskas et al., 2017).