Distributional Mean Embeddings
- Distributional mean embeddings are a framework that represents entire probability distributions in RKHS by capturing all moments and dependencies when using characteristic kernels.
- They enable efficient estimation methods—including plug-in, inverse-probability weighting, and doubly robust strategies—alongside Nyström approximations for scalable computations.
- Their application spans causal inference, off-policy evaluation, and distributional reinforcement learning by leveraging MMD-based hypothesis tests and full distribution mapping.
Distributional mean embeddings provide a rigorous framework for representing full probability distributions as elements in Hilbert spaces, especially Reproducing Kernel Hilbert Spaces (RKHS). This approach enables nonparametric, high-dimensional, and distribution-sensitive inference, learning, and hypothesis testing. Distinct from classical moment-based summaries, kernel mean embeddings capture all distributional information (including higher moments and dependence structures) provided the kernel is characteristic. Their applications span causal inference, off-policy evaluation, distributional reinforcement learning, density estimation, and efficient large-scale probabilistic computations.
1. Formal Definition and Properties
Let denote a measurable space with a positive definite kernel and associated RKHS . For any probability law on such that , the (kernel) mean embedding of is
If is characteristic (e.g., Gaussian, Matérn), the mapping is injective: uniquely determines (Zenati et al., 3 Jun 2025, Fawkes et al., 2022). This embedding respects the geometry of probability distributions, as distances in RKHS (notably, ) metrize weak convergence and correspond to the maximum mean discrepancy (MMD) (Muzellec et al., 2021, Chatalic et al., 2022).
Empirical estimates from samples are given by
which converges in RKHS norm at (Zenati et al., 3 Jun 2025, Chatalic et al., 2022).
2. Methodologies and Algorithms
2.1 Estimation Approaches
Mean embeddings admit several estimation paradigms depending on the problem context:
- IPW (Inverse-Probability Weighting) & Regression-Only: In causal inference, counterfactual mean embeddings can be estimated by IPW, regression, or their combinations (see Section 2.2 below) (Fawkes et al., 2022).
- Plug-in Estimator: For conditional mean embeddings , kernel ridge regression is used to estimate conditional operators, producing plug-in estimators for, e.g., counterfactual policy mean embeddings (CPME) (Zenati et al., 3 Jun 2025).
- Doubly Robust (DR) Estimation: By incorporating estimators for both outcome and propensity models, DR estimators maintain consistency if either model is correctly specified. In embedding space, they yield improved convergence rates and first-order bias correction (Zenati et al., 3 Jun 2025, Fawkes et al., 2022).
- Nyström Approximation: For large scale, empirical mean embeddings can be compressed via the Nyström method: projecting onto a subspace spanned by "landmark" points, with provable error rates under spectral conditions (Chatalic et al., 2022).
2.2 Optimization Frameworks
Direct optimization of probability distributions in RKHS is enabled by sums-of-squares (SoS) parameterizations:
- Given anchor points and , define
enforcing nonnegativity and normalization for densities (Muzellec et al., 2021).
Optimization is performed by minimizing the empirical MMD between samples and SoS-parameterized densities, typically using projected gradient methods.
3. Applications in Causal Inference and Off-Policy Evaluation
Distributional mean embeddings are foundational for nonparametric causal inference beyond average effects:
- Counterfactual Mean Embedding (CME) & Policy Mean Embedding (CPME): Defines global and treatment-/policy-specific mean embeddings in RKHS for potential outcomes (Fawkes et al., 2022, Zenati et al., 3 Jun 2025).
- Doubly Robust Distributional Estimation: Embeddings admit DR estimators
with -consistency if either outcome or propensity models are correct (Fawkes et al., 2022).
- Permutation Tests: Testing for causal or policy effects on the entire distribution employs the MMD between counterfactual embeddings as a test statistic (Fawkes et al., 2022, Zenati et al., 3 Jun 2025). Asymptotically normal, DR kernel-based tests support confidence intervals for distributional quantities (Zenati et al., 3 Jun 2025).
- Generating Pseudo-Samples: Kernel herding allows generation of pseudo-samples from the embedded distribution, enabling visualization and downstream tasks (Zenati et al., 3 Jun 2025).
4. Distributional Mean Embeddings in Reinforcement Learning
Distributional RL extends classical value functions to capture entire return distributions:
- Sketch Mean Embedding: States are associated with , where is a feature map over returns. The policy's full return law is thus encoded nonparametrically (Wenliang et al., 2023).
- Distributional Bellman Operators: Linear “sketch Bellman operators” are derived by approximating feature maps composed with Bellman transformations. If linearization holds (), dynamic programming and TD learning admit efficient updates in embedding space (Wenliang et al., 2023).
- Deep RL Extensions: Nonlinear (deep neural network) parameterizations of , with precomputed read-out vectors for Q-values, allow direct learning of mean embeddings, outperforming classical and quantile-based agents on the Arcade Learning Environment (Wenliang et al., 2023).
5. Efficient Computation and Large-Scale Techniques
The principal computational bottleneck is storage and manipulation of high-/infinite-dimensional embeddings:
- Nyström Embedding (Low-Rank Compression): Subsampling landmarks, the Nyström mean embedding is
with , Gram matrices on landmarks and samples (Chatalic et al., 2022). Proper selection of (e.g., ) yields empirical error , and MMD computations among compressed embeddings similarly scale with rather than .
- Quadrature Applications: Nyström-approximated mean embeddings support kernel quadrature with explicit error bounds (Chatalic et al., 2022).
6. Limitations and Theoretical Nuances
- Pre-Image/Representability: Not every element of is a mean embedding of a probability law. The mean embedding set is a convex hull of feature maps, and approximation of general RKHS elements by mean embeddings is nontrivial (Muzellec et al., 2021).
- Scalability: While the Nyström method compresses computation, uniform sampling may be statistically suboptimal compared to leverage-score or determinant-based sampling; such alternatives, however, can require challenging auxiliary computations (Chatalic et al., 2022).
- Expressivity: Optimization in RKHS with low-rank or rank-one parameterizations may under-fit complex multimodal distributions unless sufficient anchors are chosen or higher-rank representations are utilized (Muzellec et al., 2021).
- Large-Scale Bias/Approximation: Constants hidden in theoretical bounds, spectral decay rates, and implementation details can affect performance; slow decays or high intrinsic dimensionality may necessitate large in practice (Chatalic et al., 2022).
7. Broader Impact and Recent Innovations
Distributional mean embeddings enable nonparametric, distribution-sensitive inference in a variety of domains:
- Causal Inference and Policy Evaluation: The CPME framework generalizes prior CME approaches to general discrete/continuous action spaces, delivers DR estimators with rigorous convergence rates, MMD-based distributional hypothesis tests, and supports counterfactual sampling (Zenati et al., 3 Jun 2025).
- Distributional RL: Embedding-based sketching achieves state-of-the-art empirical performance and theoretical soundness in estimating full return distributions (Wenliang et al., 2023).
- Language Representations: Bayesian skip-gram models instantiate distributional mean embeddings for words, capturing uncertainty and polysemy with empirically validated improvements over point estimates (Bražinskas et al., 2017).
- Density Estimation and Generative Modeling: Optimization of SoS parameterized densities in RKHS enables robust density matching and generative modeling directly in MMD geometry (Muzellec et al., 2021).
In sum, distributional mean embeddings furnish a powerful framework for representing and manipulating entire distributions in function spaces, underpinning a broad spectrum of modern statistical, machine learning, and reinforcement learning methodology (Zenati et al., 3 Jun 2025, Fawkes et al., 2022, Wenliang et al., 2023, Chatalic et al., 2022, Muzellec et al., 2021, Bražinskas et al., 2017).