Generative Distribution Embeddings

Published 23 May 2025 in cs.LG, q-bio.QM, and stat.ML | (2505.18150v1)

Abstract: Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning representations of cell populations from lineage-tracing data (150K cells), predicting perturbation effects on single-cell transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), modeling tissue-specific DNA methylation patterns (253M sequences), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).

Abstract PDF Upgrade to Chat

Summary

Generative Distribution Embeddings: A Framework for Modeling Population-Level Structure

The study introduces a novel framework termed Generative Distribution Embeddings (GDEs), which transforms the scope of autoencoders to operate in the distribution space. This approach pivots on learning representations of entire datasets, where previously computational machinery typically focused on singular data points. This paradigm shift reflects a critical need in applications necessitating hierarchical data modeling, such as computational biology, genomics, and medicine. GDEs leverages distribution-invariant encoders and conditional generative models to capture structural information inherent to distribution-level data while abstracting sampling noise, facilitating robust inference in complex data environments.

Framework Description

GDEs lift the conventional autoencoder model architecture into a higher-depth model wherein the encoder processes sets of input samples and the decoder functions as a generator that crafts a distribution matching these input samples. Within this model, strong distributional representations are achieved by coupling conditional generative models with encoders adhering to distributional invariance—ensuring that these encoders depend solely on the empirical distribution derived from finite observations rather than their specific order or multiplicity.

Benchmarking and Applications

The authors systematically explore GDEs through both synthetic datasets and real-world biological domains, demonstrating GDEs' superiority in retention of distributional information relative to existing methods. Remarkable performance improvements are noted in synthetic scenarios modeling multivariate Gaussian distributions and in practical applications spanning cell population modeling, gene expression predictions, and sequence analysis in genomics.

Key Applications Include:

Lineage Tracing in Single-Cell RNA Sequencing Data: By modeling clonal populations rather than individual cells, GDEs facilitate prediction of future genomic states.
Single-Cell Transcriptome Responses to Genetic Perturbations: GDEs accurately predict transcriptional effects of gene knockups by focusing on higher-order data distributions.
DNA Sequence Analysis for Methylation Patterns: GDEs discern tissue-specific methylation directly from bisulfite sequencing reads without preprocessing, overcoming traditional barriers to comprehensive data analysis.
Optimization of Image-Based Genetic Screening: GDEs are effective in reconstructing cellular phenotypic features induced by perturbations, with demonstrated high fidelity across perturbation types.
Protein Sequence Analysis: By modeling SARS-CoV2 spike protein sequences across time and location, GDEs unveil insightful spatiotemporal patterns.

Theoretical Insights and Implications

The foundational theory parallels two primary conceits: firstly, GDEs learn predictive sufficient statistics, abstracting sampling noise while honing on essential distributional attributes. Secondly, they approximate smooth isometric embeddings of statistical manifolds within Wasserstein space, capturing intrinsic geometric relationships between distributions that support versatile generative modeling. The connection established between GDEs and fundamental concepts in information geometry and empirical process theory stands pivotal, fostering further extensions to multi-scale and hierarchical data structures.

Conclusion

Generative Distribution Embeddings outline a comprehensive, flexible, and scalable approach to distribution-level inference across diverse scientific domains. They suggest a robust framework that can adapt and evolve to future advancements in generative modeling, with potential applications spanning a wide range of multiscale and multiplatform data analyses. Such advancements invite nuanced exploration into deeper theoretical properties and promise extended applications, emphasizing statistical fidelity, precision, and interpretability in data-rich environments.