MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
Abstract: Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt, but also compatible with each other. In this work, we propose a light-weight approach to achieving this compatibility between different regions of an image, using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper, and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model, MarkovGen, uses this proposed MRF model to both speed up Muse by 1.5X and produce higher quality images by decreasing undesirable image artifacts.
- Fast high-dimensional filtering using the permutohedral lattice. In Eurographics, 2010.
- Muse: Text-to-image generation via masked generative transformers. ICML, 2023.
- Maskgit: Masked generative image transformer. In CVPR, 2022.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
- Pali: A jointly-scaled multilingual language-image model, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
- Efficient inference in fully connected crfs with gaussian edge potentials. In NeurIPS, 2011.
- Multi-concept customization of text-to-image diffusion, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Y. Liu M. Ghazvininejad, O. Levy and L. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP, 2019.
- Midjourney, 2022. https:://www.midjourney.com.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 2020.
- Hierarchical text-conditional image generation with clip latents. preprint, 2022. [arxiv:2204.06125].
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Fast text-conditional discrete denoising on vector-quantized latent spaces. preprint, 2022. [arXiv:2211.07292].
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. preprint, 2022. [arXiv:2205.11487].
- Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
- Make-a-video: Text-to-video generation without text-video data, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Non-autoregressive text generation with pre-trained language models. In EACL, 2021.
- Fast structured decoding for sequence models. In NeurIPS, 2019.
- A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.
- Neural discrete representation learning. preprint, 2017. [arXiv:1711.00937].
- Phenaki: Variable length video generation from open domain textual description, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. In ICML, 2022.
- Adding conditional control to text-to-image diffusion models, 2023.
- Conditional random fields as recurrent neural networks. In ICCV, 2015.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.