Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unity by Diversity: Improved Representation Learning in Multimodal VAEs

Published 8 Mar 2024 in cs.LG and cs.AI | (2403.05300v5)

Abstract: Variational Autoencoders for multimodal data hold promise for many tasks in data analysis, such as representation learning, conditional generation, and imputation. Current architectures either share the encoder output, decoder input, or both across modalities to learn a shared representation. Such architectures impose hard constraints on the model. In this work, we show that a better latent representation can be obtained by replacing these hard constraints with a soft constraint. We propose a new mixture-of-experts prior, softly guiding each modality's latent representation towards a shared aggregate posterior. This approach results in a superior latent representation and allows each encoding to preserve information better from its uncompressed original features. In extensive experiments on multiple benchmark datasets and two challenging real-world datasets, we show improved learned latent representations and imputation of missing data modalities compared to existing methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Nonspatial sequence coding in ca1 neurons. Journal of Neuroscience, 36(5):1547–1563, 2016. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.2874-15.2016. URL http://www.jneurosci.org/content/36/5/1547.
  2. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
  3. Multi-level variational autoencoder: Learning disentangled representations from grouped observations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  4. A Simple Framework for Contrastive Learning of Visual Representations, June 2020. URL http://arxiv.org/abs/2002.05709. arXiv:2002.05709 [cs, stat].
  5. Self-supervised Disentanglement of Modality-specific and Shared Factors Improves Multimodal Generative Models. German Conference on Pattern Recognition, 2020. Publisher: Springer.
  6. On the Limitations of Multimodal VAEs. International Conference on Learning Representations, 2022.
  7. Efron, B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012.
  8. Falcon, W. and The PyTorch Lightning team. PyTorch Lightning, March 2019. URL https://github.com/Lightning-AI/lightning.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  10. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016. URL https://openreview.net/forum?id=Sy2fzU9gl.
  11. Hosoya, H. A simple probabilistic deep generative model for learning generalizable disentangled representations from grouped data. CoRR, abs/1809.0, 2018.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014.
  14. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pp.  2278–2324, 1998. Issue: 11.
  15. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
  16. Lin, J. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, January 1991. ISSN 1557-9654. doi: 10.1109/18.61115. URL https://ieeexplore.ieee.org/document/61115. Conference Name: IEEE Transactions on Information Theory.
  17. Deep Learning Face Attributes in the Wild. In The IEEE International Conference on Computer Vision (ICCV), 2015.
  18. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348–6359. PMLR, 2020.
  19. Representation Learning with Contrastive Predictive Coding, January 2019. URL http://arxiv.org/abs/1807.03748. arXiv:1807.03748 [cs, stat].
  20. Mmvae+: Enhancing the generative quality of multimodal vaes without compromises. In The Eleventh International Conference on Learning Representations. OpenReview, 2023.
  21. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  22. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  23. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URL http://arxiv.org/abs/2103.00020. arXiv:2103.00020 [cs].
  24. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831, 2021.
  25. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  26. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. CoRR, abs/2205.1, 2022. doi: 10.48550/arXiv.2205.11487. URL https://doi.org/10.48550/arXiv.2205.11487.
  27. Hippocampal ensembles represent sequential relationships among an extended sequence of nonspatial events. Nature Communications, 13(1):787, February 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-28057-6. URL https://www.nature.com/articles/s41467-022-28057-6.
  28. Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models. In Advances in Neural Information Processing Systems, pp.  15692–15703, 2019.
  29. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016.
  30. Sutter, T. M. Imposing and Uncovering Group Structure in Weakly-Supervised Learning. Doctoral Thesis, ETH Zurich, 2023. URL https://www.research-collection.ethz.ch/handle/20.500.11850/634822. Accepted: 2023-10-04T05:58:47Z.
  31. Multimodal Generative Learning Utilizing Jensen-Shannon Divergence. Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.08242.
  32. Generalized Multimodal ELBO. International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5Y21V0RDBV.
  33. Learning Group Importance using the Differentiable Hypergeometric Distribution. In International Conference on Learning Representations, 2023a.
  34. Differentiable Random Partition Models. In Advances in Neural Information Processing Systems, 2023b.
  35. A survey of multimodal deep generative models. Advanced Robotics, 36(5-6):261–278, 2022.
  36. Contrastive Multiview Coding, December 2020. URL http://arxiv.org/abs/1906.05849. arXiv:1906.05849 [cs].
  37. VAE with a VampPrior. arXiv preprint arXiv:1705.07120, 2017.
  38. Multimodal Generative Models for Scalable Weakly-Supervised Learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montreal, Canada, pp. 5580–5590, February 2018. URL http://arxiv.org/abs/1802.05335.
Citations (1)

Summary

  • The paper introduces MM-VAMP VAE using a mixture-of-experts prior to flexibly aggregate modality information.
  • It demonstrates improved latent representations that boost conditional generation and data imputation on benchmark datasets.
  • The model effectively balances shared global structures with modality-specific details, applicable to complex data including neuroscience.

Unity by Diversity: Improved Representation Learning in Multimodal VAEs

Introduction

The fusion of diverse modalities is pivotal for a nuanced understanding of complex phenomena. Multimodal Variational Autoencoders (VAEs) serve as a promising approach for synthesizing shared and modality-specific information. Traditional architectures often impose rigid constraints by sharing encoder outputs or decoder inputs, which can lead to sub-optimal latent representations.

Proposed Methodology

The paper introduces a novel approach using a mixture-of-experts prior for multimodal VAEs, named the Multimodal Variational Mixture-of-Experts (MM-VAMP) VAE. This method replaces hard constraints with soft constraints, offering a superior latent representation by guiding each modality’s latent space towards a shared aggregate posterior.

This approach hinges on a mixture-of-experts prior, which facilitates a more flexible representation by allowing each encoding to better preserve information from its uncompressed original features. The MM-VAMP VAE markedly improves performance in tasks such as conditional generation and data imputation. Figure 1

Figure 1

Figure 1

Figure 1: Independent VAEs.

Experiments and Results

Benchmark Datasets

The paper utilizes various benchmark datasets including synthetic and complex real-world datasets for evaluation:

  1. PolyMNIST: Demonstrates that MM-VAMP VAE provides superior latent representation and coherent conditional generation. Figure 2

    Figure 2: PolyMNIST (translated, scale=75%): every column is a multimodal tuple X\bm{X}.

  2. Bimodal CelebA: Shows the efficacy of MM-VAMP VAE in handling more complex, multimodal data with improved accuracy and generation coherence. Figure 3

    Figure 3: Bimodal CelebA: three samples of image-text pairs.

Neuroscience Application

The approach is also applied to a challenging neuroscience problem involving hippocampal neural activities. By treating each subject as a unique modality, MM-VAMP VAE allows a detailed analysis of neural patterns shared across subjects while capturing individual differences—highlighting the model's potential to advance understanding in neuroscience applications.

Theoretical Insights and Implications

The MM-VAMP VAE’s approach is theoretically optimal for the tasks it addresses. Its objective function leverages contrastive learning by encouraging similarity between the unimodal posterior approximations, effectively minimizing the Jensen-Shannon divergence.

This method provides a crucial balance between capturing shared global structures and preserving modality-specific details, leading to enhanced generalization capabilities across unseen multimodal configurations.

Conclusion

The introduction of MM-VAMP VAE represents a significant step forward in the development of multimodal VAEs. By utilizing a mixture-of-experts prior, the model overcomes limitations associated with previous aggregation-based approaches, providing a robust framework for improved representation learning.

Future work may explore extending these ideas to more powerful generative models, potentially broadening the applicability and efficacy of multimodal learning paradigms in AI.

Figures

  • Figure 4: Latent Representation Classification.
  • Figure 5: Another Latent Representation Classification.
  • Figure 6: Yet Another Latent Representation Classification.

Each figure demonstrates the model's performance across different tasks and datasets, underscoring its versatility and robustness.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 7 likes about this paper.