Papers
Topics
Authors
Recent
Search
2000 character limit reached

MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Published 14 Aug 2023 in cs.CV, cs.AI, and cs.LG | (2308.10997v3)

Abstract: Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Fast high-dimensional filtering using the permutohedral lattice. In Eurographics, 2010.
  2. Muse: Text-to-image generation via masked generative transformers. ICML, 2023.
  3. Maskgit: Masked generative image transformer. In CVPR, 2022.
  4. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  5. Pali: A jointly-scaled multilingual language-image model, 2022.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
  8. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  9. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  10. Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022.
  11. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2021.
  12. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  13. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  14. Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, 2011.
  15. Multi-concept customization of text-to-image diffusion, 2023.
  16. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  17. Y. Liu M. Ghazvininejad, O. Levy and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP, 2019.
  18. Midjourney, 2022. https:://www.midjourney.com.
  19. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
  21. Hierarchical text-conditional image generation with clip latents. preprint, 2022. [arxiv:2204.06125].
  22. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  23. Fast text-conditional discrete denoising on vector-quantized latent spaces. preprint, 2022. [arXiv:2211.07292].
  24. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  25. Photorealistic text-to-image diffusion models with deep language understanding. preprint, 2022. [arXiv:2205.11487].
  26. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  27. Make-a-video: Text-to-video generation without text-video data, 2022.
  28. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  29. Non-autoregressive text generation with pre-trained language models. In EACL, 2021.
  30. Fast structured decoding for sequence models. In NeurIPS, 2019.
  31. A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.
  32. Neural discrete representation learning. preprint, 2017. [arXiv:1711.00937].
  33. Phenaki: Variable length video generation from open domain textual description, 2022.
  34. Scaling autoregressive models for content-rich text-to-image generation. In ICML, 2022.
  35. Adding conditional control to text-to-image diffusion models, 2023.
  36. Conditional random fields as recurrent neural networks. In ICCV, 2015.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.