Modeling Real-World Data Distributions for Machine Learning Theory
Determine the probability distributions underlying real-world data modalities such as natural images and natural language that are suitable for high-dimensional machine learning theory, by developing a principled modeling framework (or identifying universality classes and their statistical descriptors) that captures the aspects of data most relevant to learning and generalization.
References
In contrast to NNs, which are by design mathematical constructs and can thus be formally modelled rather straightforwardly, it is to a large extent unclear how to model the distribution of real data. What indeed is the probability distribution of e.g. images of cats and dogs? Of natural language? Satisfyingly answering these interrogations is to a large extent an open question in ML theory.