Modeling Real-World Data Distributions for Machine Learning Theory

Determine the probability distributions underlying real-world data modalities such as natural images and natural language that are suitable for high-dimensional machine learning theory, by developing a principled modeling framework (or identifying universality classes and their statistical descriptors) that captures the aspects of data most relevant to learning and generalization.

Background

A central challenge highlighted in the paper is that many exact asymptotic analyses of learning rely on stylized data models (e.g., isotropic Gaussians, Gaussian mixtures), which may not capture the complexity of real datasets such as images or language. While Gaussian universality provides a path in some settings by reducing data modeling to matching low-order moments, it does not cover all scenarios and does not provide a general principle for selecting appropriate models.

The authors argue for identifying the relevant statistical descriptors or universality classes that govern learning performance for given architectures and tasks, so that realistic surrogate distributions can be used in theory. They note that, at present, there is no principled way to determine which universality class applies to a given setting, underscoring the need for a foundational framework for real-data modeling in high-dimensional ML theory.

References

In contrast to NNs, which are by design mathematical constructs and can thus be formally modelled rather straightforwardly, it is to a large extent unclear how to model the distribution of real data. What indeed is the probability distribution of e.g. images of cats and dogs? Of natural language? Satisfyingly answering these interrogations is to a large extent an open question in ML theory.

High-dimensional learning of narrow neural networks  (2409.13904 - Cui, 2024) in Subsubsection "Data structure", Section 6 (Perspectives)