TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
Abstract: The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of LLMs for text rendering. Firstly, we fine-tune a LLM for layout planning. The LLM is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the LLM within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Scene text recognition with permuted autoregressive sequence models. In ECCV, 2022.
- Position-enhanced visual instruction tuning for multimodal large language models. arXiv preprint arXiv:2308.13437, 2023a.
- Textdiffuser: Diffusion models as text painters. In NeurIPS, 2023b.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023c.
- Pix2seq: A language modeling framework for object detection. In ICLR, 2021.
- A unified sequence interface for vision tasks. In NeurIPS, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- DALLE-3. Link: https://openai.com/dall-e-3, 2023.
- Discovering the hidden vocabulary of dalle-2. arXiv preprint arXiv:2206.00169, 2022.
- DeepFloyd. Github link: https://github.com/deep-floyd/if, 2023.
- Rico: A mobile app dataset for building data-driven design applications. In UIST, 2017.
- Layoutgpt: Compositional visual planning and generation with large language models. In NeurIPS, 2023.
- Generative adversarial nets. In NeurIPS, 2014.
- GPT-4. Link: https://openai.com/gpt-4, 2023.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
- Layouttransformer: Layout generation and completion with self-attention. In ICCV, 2021.
- Diffusion-based document layout generation. In ICDAR, 2023.
- Most: A multi-oriented scene text detector with localization refinement. In CVPR, 2021.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- ideogram. Link: https://ideogram.ai/, 2023.
- Layoutvae: Stochastic scene layout generation from a label set. In ICCV, 2019.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Layoutgan: Generating graphic layouts with wireframe discriminators. In ICLR, 2019.
- Trocr: Transformer-based optical character recognition with pre-trained models. In AAAI, 2023a.
- Gligen: Open-set grounded text-to-image generation. In CVPR, 2023b.
- Real-time scene text detection with differentiable binarization. In AAAI, 2020.
- Layoutprompter: Awaken the design ability of large language models. In NeurIPS, 2023.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Character-aware models improve visual text rendering. In ACL, 2023.
- Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- Multi-oriented scene text detection via corner localization and region segmentation. In CVPR, 2018.
- Arbitrary-oriented scene text detection via rotation proposals. IEEE transactions on multimedia, 2018.
- Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870, 2023.
- Read: Recursive autoencoders for document layout generation. In CVPRW, 2020.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Palette: Image-to-image diffusion models. In SIGGRAPH, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022b.
- Neural machine translation of rare words with subword units. In ACL, 2016.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 2016.
- Denoising diffusion implicit models. In ICLR, 2021.
- Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In ICDAR, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 2022.
- Glyphcontrol: Glyph conditional control for visual text generation. In NeurIPS, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093, 2021.
- Chinese text recognition with a pre-trained clip-like model through image-ids aligning. In ICCV, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023b.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. In NeurIPS, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Regionblip: A unified multi-modal pre-training framework for holistic and regional comprehension. arXiv preprint arXiv:2308.02299, 2023.
- East: an efficient and accurate scene text detector. In CVPR, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.