Multi-Facet Clustering Variational Autoencoders

Published 9 Jun 2021 in stat.ML, cs.CV, cs.LG, and stat.ME | (2106.05241v2)

Abstract: Work in deep clustering focuses on finding a single partition of data. However, high-dimensional data, such as images, typically feature multiple interesting characteristics one could cluster over. For example, images of objects against a background could be clustered over the shape of the object and separately by the colour of the background. In this paper, we introduce Multi-Facet Clustering Variational Autoencoders (MFCVAE), a novel class of variational autoencoders with a hierarchy of latent variables, each with a Mixture-of-Gaussians prior, that learns multiple clusterings simultaneously, and is trained fully unsupervised and end-to-end. MFCVAE uses a progressively-trained ladder architecture which leads to highly stable performance. We provide novel theoretical results for optimising the ELBO analytically with respect to the categorical variational posterior distribution, correcting earlier influential theoretical work. On image benchmarks, we demonstrate that our approach separates out and clusters over different aspects of the data in a disentangled manner. We also show other advantages of our model: the compositionality of its latent space and that it provides controlled generation of samples.

Abstract PDF Upgrade to Chat

Citations (37)

View on Semantic Scholar

Summary

The paper introduces a deep clustering method that simultaneously learns multiple data facets using a hierarchical VAE with a Mixture-of-Gaussians prior.
It demonstrates competitive performance on datasets like MNIST, 3DShapes, and SVHN by effectively disentangling distinct data characteristics.
The model redefines the discrete posterior via Monte Carlo sampling to minimize bias and enhance training stability in the clustering process.

The paper presents Multi-Facet Clustering Variational Autoencoders (MFCVAE), a novel approach to deep clustering that addresses the limitations of traditional methods by simultaneously learning multiple data clusterings. High-dimensional datasets such as images or speech data have various abstract characteristics that cannot be adequately captured by a single partitioning strategy. MFCVAE is designed to learn these multiple facets through a hierarchical latent variable model with a Mixtures-of-Gaussians (MoG) prior.

This approach permits disentanglement, capturing distinct abstract data characteristics across separate latent facets efficiently. Each facet represents different levels or aspects intrinsic to the data, while the model's structure incorporates prior independence assumptions across these facets, allowing for sophisticated and scalable learning of the underlying data characteristics (Figure 1).

Figure 1: Latent space of a (a) single-facet model and a (b) multi-facet model ( $J=3$ ) with two dimensions ( $z_1$ , $z_2$ ) per facet. The multi-facet model disentangles data characteristics into three sensible partitions.

MFCVAE Model Architecture

The MFCVAE architecture builds upon Variational Autoencoders (VAEs) by integrating multiple facets into the clustering process. For each facet, the model applies a Mixture-of-Gaussians prior: $c_j \sim \Cat({\pi}_j), \quad {z}_j \mid c_j \sim \mathcal{N}(\bm{\mu}_{c_j}, \bm{\Sigma}_{c_j})$ where each latent facet $c_j$ represents statistics over observed data samples significant to the analysis. The generative process (depicted in Figure 2) relies on a ladder network architecture that is progressively trained. This architecture fosters stable convergence and disentangled representations by conditioning the decoder on progressively deeper layers, each capturing different abstraction levels.

Figure 2: Graphical model of MFCVAE showcasing the variational posterior $q_{\phi}(\vv{z},{c} | {x})$ and the generative model $p_{\theta}({x},\vv{z},{c})$.

Theoretical Insights and VaDE Tricks

The paper presents novel theoretical advancements, particularly in optimizing the Evidence Lower Bound (ELBO) for models capturing multiple clustering facets. By extending the VaDE trick—originally meant for single-facet models—the authors provide an improved Bayesian posterior for the latent discrete variables. This optimization minimizes bias during model training:

Figure 3: Synthetic samples generated from MFCVAE with various latent facets for different datasets showcasing compositional generative capabilities.

The primary innovation is redefining the posterior for discrete variables using Monte Carlo-estimated samples from the continuous latent space, obviating instability issues related to Gumbel-Softmax tricks typical in hierarchical VAE architectures.

Experimental Evaluation

MFCVAE has been tested across several datasets (MNIST, 3DShapes, and SVHN), demonstrating competitive performance relative to generative and non-generative clustering models. It achieves unsupervised classification comparable to models like VaDE, while ensuring superior facet disentanglement that permits independent mathematical manipulation. For instance, on MNIST, the model separate distinct data features such as digit class and stroke style, allowing for targeted data generation and classification enhancements:

Figure 4: Test accuracy over training epochs for models trained on MNIST illustrating robust performance with different architectural configurations.

Quantitative results reveal the model’s strong clustering performance, measured as unsupervised clustering accuracy, aligning closely with supervised class labels. The architecture’s robustness is supported by considerable stability across multiple experimental configurations, evidenced by narrow performance spread over multiple seed trials.

Conclusion

The MFCVAE approach formally advances deep clustering research by respecting high-dimensional data's inherent complexity through structured generative processes. Its model architecture and training paradigm facilitate scalable, end-to-end differentiable learning of independent data facets, addressing previously noted challenges with stability and facet disentanglement. Future research directions suggest exploring automatic tuning of latent dimensions, broader application scenarios beyond imaging, and advanced regularisation techniques to further optimise facet-specific representation learning.