AnyText: Multilingual Visual Text Generation And Editing
Abstract: Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.
- ArT. Icdar2019 robust reading challenge on arbitrary-shaped text. https://rrc.cvc.uab.es/?ch=14, 2019.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint, 2022.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint, abs/2211.09800, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint, abs/2301.00704, 2023.
- Diffute: Universal text editing diffusion model. arXiv preprint, abs/2305.10825, 2023a.
- Textdiffuser: Diffusion models as text painters. arXiv preprint, abs/2305.10855, 2023b.
- Dreamidentity: Improved editability for efficient face-identity preserved image generation. arXiv preprint, abs/2307.00300, 2023c.
- COCO-Text. A large-scale scene text dataset based on mscoco. https://bgshih.github.io/cocotext, 2016.
- DeepFloyd-Lab. Deepfloyd if. https://github.com/deep-floyd/IF, 2023.
- Diffusion models beat gans on image synthesis. In NeurIPS, pp. 8780–8794, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Wukong: 100 million large-scale chinese cross-modal pre-training dataset and A foundation framework. CoRR, abs/2202.06767, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint, abs/2302.09778, 2023.
- Midjourney Inc. Midjourney. https://www.midjourney.com/, 2022.
- Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- Pp-ocrv3: More attempts for the improvement of ultra lightweight OCR system. CoRR, abs/2206.03001, 2022.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint, abs/2301.12597, 2023.
- Character-aware models improve visual text rendering. In ACL, pp. 16270–16297, 2023.
- LSVT. Icdar2019 robust reading challenge on large-scale street view text with partial labeling. https://rrc.cvc.uab.es/?ch=16, 2019.
- Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint, abs/2303.17870, 2023a.
- Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint, abs/2303.09319, 2023b.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- MLT. Icdar 2019 robust reading challenge on multi-lingual scene text detection and recognition. https://rrc.cvc.uab.es/?ch=15, 2019.
- ModelScope. Duguangocr. https://modelscope.cn/models/damo/cv_convnextTiny_ocr-recognition-general_damo/summary, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint, abs/2302.08453, 2023.
- MTWI. Icpr 2018 challenge on multi-type web images. https://tianchi.aliyun.com/dataset/137084, 2018.
- Improved denoising diffusion probabilistic models. In ICML, volume 139, pp. 8162–8171, 2021.
- PaddlePaddle. Pp-ocrv4. https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/PP-OCRv4_introduction.md, 2023.
- Zero-shot text-to-image generation. In ICML, volume 139, pp. 8821–8831. PMLR, 2021.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint, abs/2204.06125, 2022.
- RCTW. Icdar2017 competition on reading chinese text in the wild. https://rctw.vlrlab.net/dataset, 2017.
- ReCTS. Icdar 2019 robust reading challenge on reading chinese text on signboard. https://rrc.cvc.uab.es/?ch=12, 2019.
- OCR-VQGAN: taming text-within-image generation. In WACV, pp. 3678–3687, 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, pp. 10684–10695, June 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- LAION-400M: open dataset of clip-filtered 400 million image-text pairs. CoRR, abs/2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 25278–25294. Curran Associates, Inc., 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint, abs/2304.03411, 2023.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
- ELITE: encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint, abs/2302.13848, 2023.
- Glyphcontrol: Glyph conditional control for visual text generation. arXiv preprint, abs/2305.18259, 2023.
- Adding conditional control to text-to-image diffusion models. arXiv preprint, abs/2302.05543, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.