MADE: Masked Autoencoder for Distribution Estimation

Published 12 Feb 2015 in cs.LG, cs.NE, and stat.ML | (1502.03509v2)

Abstract: There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder's parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with state-of-the-art tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators.

Abstract PDF Upgrade to Chat

Citations (824)

View on Semantic Scholar

Summary

The paper introduces MADE, a masked autoencoder that enforces autoregressive constraints to enable robust density estimation.
It employs strategic masking in the autoencoder layers to ensure valid probabilistic modeling while sharing parameters for efficiency.
Empirical evaluations on datasets like DNA and Binarized MNIST show competitive improvements in negative log-likelihood compared to existing models.

MADE: Masked Autoencoder for Distribution Estimation

Introduction

The paper "MADE: Masked Autoencoder for Distribution Estimation" by Germain et al. introduces the Masked Autoencoder for Distribution Estimation (MADE) framework, which is a novel approach to density estimation in machine learning. MADE provides a robust and efficient method for autoregressive modelling using the principle of masking to enforce autoregressive constraints within autoencoders. The authors represent a collaboration between Université de Sherbrooke, Google DeepMind, and the University of Edinburgh.

Methodology

The key innovation in MADE lies in its ability to serve as an autoregressive model while leveraging the computational advantages of autoencoders. The authors modify the traditional autoencoder by incorporating masks that nullify specific connections within the neural network. This masking ensures that the model adheres to autoregressive properties: the prediction of each variable depends only on the preceding variables in a specified ordering.

Three main contributions are highlighted:

Masked Connections: The introduction of masks in the autoencoder's connections enforces autoregressive constraints, ensuring that the model remains a valid probabilistic graphical model.
Parameter Efficiency: MADE shares parameters across different autoregressive factorizations, which leads to a reduction in the number of parameters compared to other models like the fully visible sigmoid belief networks (FVSBNs).
Scalability: The model retains the computational efficiency of autoencoders, making it easily scalable to high-dimensional data.

Results

The performance of MADE is evaluated on several benchmark datasets, including Adult, Connect4, DNA, Mushrooms, NIPS-0-12, Ocr-letters, RCV1, Web, and Binarized MNIST. The central metric used for evaluation is the negative log-likelihood (NLL).

Notable results include:

On the DNA dataset, MADE with mask sampling achieved an NLL of 79.66 compared to EoNADE's 82.31.
For binarized MNIST, different configurations of MADE (with varying hidden layers and mask counts) demonstrated superior performance. Specifically, MADE with two hidden layers and 32 masks obtained an NLL of 86.64, outperforming configurations with fewer masks and hidden layers.

Implications

The practical implications of MADE are significant for density estimation in unsupervised learning tasks. It provides a tool for probabilistic modeling that is both parameter-efficient and computationally scalable. This makes it applicable to a wide array of real-world datasets, particularly those with high dimensionality.

Theoretically, MADE's approach of utilizing masks to enforce autoregressive properties within autoencoders is an innovative contribution to the field of generative models. It bridges the gap between the computational efficiency of autoencoders and the structural validity of autoregressive models.

Future Work

Potential future developments inspired by MADE could include:

Extending the framework to handle other types of data distributions and exploring its applications in different domains such as natural language processing and image generation.
Investigating the integration of MADE with other generative modeling techniques like VAEs and GANs to further improve performance and scalability.

Conclusion

MADE represents an advanced method for distribution estimation, distinguishing itself by combining the autoregressive model framework with the computational efficiency of autoencoders through innovative use of masking. The compelling numerical results and theoretical contributions suggest that MADE is a valuable tool for the machine learning community, with promising avenues for future research and application.