Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Survey on Quality Metrics for Text-to-Image Generation

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.GR | (2403.11821v5)

Abstract: AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques, that offer precise control over scene parameters (e.g., objects, materials, and lighting). While the quality of conventionally rendered images is assessed through well established image quality metrics, such as SSIM or PSNR, the unique challenges of text-to-image generation require other, dedicated quality metrics. These metrics must be able to not only measure overall image quality, but also how well images reflect given text prompts, whereby the control of scene and rendering parameters is interweaved. Within this survey, we provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics. Our taxonomy is grounded in the assumption, that there are two main quality criteria, namely compositional quality and general quality, that contribute to the overall image quality. Besides the metrics, this survey covers dedicated text-to-image benchmark datasets, over which the metrics are frequently computed. Finally, we identify limitations and open challenges in the field of text-to-image generation, and derive guidelines for practitioners conducting text-to-image evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (168)
  1. J. Singh and L. Zheng, “Divide, evaluate, and refine: Evaluating and improving text-to-image alignment with iterative vqa feedback,” arXiv preprint arXiv:2307.04749, 2023.
  2. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  3. S. Luo, “A survey on multimodal deep learning for image synthesis: Applications, methods, datasets, evaluation metrics, and results comparison,” in Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence, ser. ICIAI ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 108–120. [Online]. Available: https://doi.org/10.1145/3461353.3461388
  4. C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion models in generative ai: A survey,” 2023.
  5. F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023.
  6. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” 2022.
  7. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,” 2022.
  8. G. Zhang, J. Bi, J. Gu, and V. Tresp, “Spot! revisiting video-language models for event understanding,” arXiv preprint arXiv:2311.12919, 2023.
  9. C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 300–309.
  10. G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 12 663–12 673.
  11. R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 22 246–22 256.
  12. S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y. Onoe, S. Laszlo, D. J. Fleet, R. Soricut, J. Baldridge, M. Norouzi, P. Anderson, and W. Chan, “Imagen editor and editbench: Advancing and evaluating text-guided image inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 18 359–18 369.
  13. Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=ppJuFSOAnM
  14. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10 684–10 695.
  15. Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 893–911. [Online]. Available: https://aclanthology.org/2023.acl-long.51
  16. AUTOMATIC1111, “Stable Diffusion Web UI,” Aug. 2022. [Online]. Available: https://github.com/AUTOMATIC1111/stable-diffusion-webui
  17. P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023.
  18. J. Xia, W. Lin, G. Jiang, Y. Wang, W. Chen, and T. Schreck, “Visual clustering factors in scatterplots,” IEEE Computer Graphics and Applications, vol. 41, no. 5, pp. 79–89, 2021.
  19. S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and M. Bernstein, “Hype: A benchmark for human eye perceptual evaluation of generative models,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/65699726a3c601b9f31bf04019c8593c-Paper.pdf
  20. B. Dolan, “Computer aided diagnosis in mammography: Its development and early challenges,” in 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, 2006, pp. 821–825.
  21. N. Chang, “Bridging the Gap Between Human Vision and Computer Vision,” 6 2023. [Online]. Available: https://kilthub.cmu.edu/articles/thesis/Bridging_the_Gap_Between_Human_Vision_and_Computer_Vision/23396759
  22. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  23. M. Cao, S. Li, J. Li, L. Nie, and M. Zhang, “Image-text retrieval: A survey on recent research and development,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, L. D. Raedt, Ed.   International Joint Conferences on Artificial Intelligence Organization, 7 2022, pp. 5410–5417, survey Track. [Online]. Available: https://doi.org/10.24963/ijcai.2022/759
  24. J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” arXiv preprint arXiv:2104.08718, 2021.
  25. R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint arXiv:1411.2539, 2014.
  26. S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” Advances in neural information processing systems, vol. 29, 2016.
  27. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” in International conference on machine learning.   PMLR, 2016, pp. 1060–1069.
  28. A. Dash, J. C. B. Gamboa, S. Ahmed, M. Liwicki, and M. Z. Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network,” arXiv preprint arXiv:1703.06412, 2017.
  29. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6629–6640.
  30. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29.   Curran Associates, Inc., 2016. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf
  31. X. Wu, K. Xu, and P. Hall, “A survey of image synthesis and editing with generative adversarial networks,” Tsinghua Science and Technology, vol. 22, no. 6, pp. 660–674, 2017.
  32. J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong, “Imagereward: Learning and evaluating human preferences for text-to-image generation,” arXiv preprint arXiv:2304.05977, 2023.
  33. Y. Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,” ArXiv, vol. abs/2305.01569, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258437096
  34. X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,” ArXiv, vol. abs/2306.09341, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259171771
  35. S. Hartwig, M. Schelling, C. v. Onzenoodt, P.-P. Vázquez, P. Hermosilla, and T. Ropinski, “Learning human viewpoint preferences from sparsely annotated models,” Computer Graphics Forum, vol. 41, no. 6, pp. 453–466, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14613
  36. X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Human preference score: Better aligning text-to-image models with human preference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 2096–2105.
  37. Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  38. T. M. Dinh, R. Nguyen, and B.-S. Hua, “Tise: Bag of metrics for text-to-image synthesis evaluation,” in European Conference on Computer Vision.   Springer, 2022, pp. 594–609.
  39. P. Grimal, H. Le Borgne, O. Ferret, and J. Tourille, “Tiam - a metric for evaluating alignment in text-to-image generation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024, pp. 2890–2899.
  40. B. Gordon, Y. Bitton, Y. Shafir, R. Garg, X. Chen, D. Lischinski, D. Cohen-Or, and I. Szpektor, “Mismatch quest: Visual and textual feedback for image-text misalignment,” 2023.
  41. T. Thrush, R. Jiang, M. Bartolo, A. Singh, A. Williams, D. Kiela, and C. Ross, “Winoground: Probing vision and language models for visio-linguistic compositionality,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 5238–5248.
  42. L. Zhang, Y. Zhou, C. Barnes, S. Amirghodsi, Z. Lin, E. Shechtman, and J. Shi, “Perceptual artifacts localization for inpainting,” in European Conference on Computer Vision.   Springer, 2022, pp. 146–164.
  43. L. Zhang, Z. Xu, C. Barnes, Y. Zhou, Q. Liu, H. Zhang, S. Amirghodsi, Z. Lin, E. Shechtman, and J. Shi, “Perceptual artifacts localization for image synthesis tasks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7579–7590.
  44. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  45. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  46. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  47. J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256390509
  48. M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in The Eleventh International Conference on Learning Representations, 2022.
  49. T. Gokhale, H. Palangi, B. Nushi, V. Vineet, E. Horvitz, E. Kamar, C. Baral, and Y. Yang, “Benchmarking spatial relationships in text-to-image generation,” ArXiv, vol. abs/2212.10015, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254877055
  50. R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article, vol. 2, p. 13, 2023.
  51. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen et al., “Simple open-vocabulary object detection with vision transformers. arxiv 2022,” arXiv preprint arXiv:2205.06230, vol. 2, 2022.
  52. D. Reis, J. Kupec, J. Hong, and A. Daoudi, “Real-time flying object detection with yolov8,” arXiv preprint arXiv:2305.09972, 2023.
  53. J. Wu, J. Wang, Z. Yang, Z. Gan, Z. Liu, J. Yuan, and L. Wang, “Grit: A generative region-to-text transformer for object understanding,” arXiv preprint arXiv:2212.00280, 2022.
  54. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  55. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds.   Cham: Springer International Publishing, 2016, pp. 382–398.
  56. Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie, “Learning to evaluate image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  57. M. Jiang, Q. Huang, L. Zhang, X. Wang, P. Zhang, Z. Gan, J. Diesner, and J. Gao, “Tiger: Text-to-image grounding for image caption evaluation,” arXiv preprint arXiv:1909.02050, 2019.
  58. P. Madhyastha, J. Wang, and L. Specia, “VIFIDEL: Evaluating the visual fidelity of image descriptions,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 6539–6550. [Online]. Available: https://aclanthology.org/P19-1654
  59. H. Lee, S. Yoon, F. Dernoncourt, D. S. Kim, T. Bui, and K. Jung, “ViLBERTScore: Evaluating image caption using vision-and-language BERT,” in Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, S. Eger, Y. Gao, M. Peyrard, W. Zhao, and E. Hovy, Eds.   Online: Association for Computational Linguistics, Nov. 2020, pp. 34–39. [Online]. Available: https://aclanthology.org/2020.eval4nlp-1.4
  60. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  61. J. Chen, J. Chen, H. Chao, and M. Yang, “Image blind denoising with generative adversarial network based noise modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  62. L. D. Tran, S. M. Nguyen, and M. Arai, “Gan-based noise model for denoising real images,” in Proceedings of the Asian Conference on Computer Vision (ACCV), November 2020.
  63. C. Tian, X. Zhang, C.-W. Lin, W. Zuo, and Y. Zhang, “Generative adversarial networks for image super-resolution: A survey,” ArXiv, vol. abs/2204.13620, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:248426817
  64. W. Ahmad, H. Ali, Z. Shah, and S. Azmat, “A new generative adversarial network for medical images super resolution,” Scientific Reports, vol. 12, no. 1, p. 9533, 2022.
  65. S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text-to-image synthesis: A review,” Neural Netw., vol. 144, no. C, p. 187–209, dec 2021. [Online]. Available: https://doi.org/10.1016/j.neunet.2021.07.019
  66. J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2555–2563.
  67. J.-H. Kim, Y. Kim, J. Lee, K. M. Yoo, and S.-W. Lee, “Mutual information divergence: A unified metric for multimodal generative models,” Advances in Neural Information Processing Systems, vol. 35, pp. 35 072–35 086, 2022.
  68. S. Castro, A. Ziai, A. Saluja, Z. Yuan, and R. Mihalcea, “Clove: Encoding compositional language in contrastive vision-language models,” arXiv preprint arXiv:2402.15021, 2024.
  69. D. H. Park, S. Azadi, X. Liu, T. Darrell, and A. Rohrbach, “Benchmark for compositional text-to-image synthesis,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. [Online]. Available: https://openreview.net/forum?id=bKBhQhPeKaF
  70. Y. Liang, J. He, G. Li, P. Li, A. Klimovskiy, N. Carolan, J. Sun, J. Pont-Tuset, S. Young, F. Yang et al., “Rich human feedback for text-to-image generation,” arXiv preprint arXiv:2312.10240, 2023.
  71. Z. Ma, C. Wang, Y. Ouyang, F. Zhao, J. Zhang, S. Huang, and J. Chen, “Cobra effect in reference-free image captioning metrics,” arXiv preprint arXiv:2402.11572, 2024.
  72. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2018, pp. 1316–1324. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2018.00143
  73. M. Yarom, Y. Bitton, S. Changpinyo, R. Aharoni, J. Herzig, O. Lang, E. O. Ofek, and I. Szpektor, “What you see is what you read? improving text-image alignment evaluation,” ArXiv, vol. abs/2305.10400, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258740893
  74. T. Hinz, S. Heinrich, and S. Wermter, “Semantic object accuracy for generative text-to-image synthesis,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 3, pp. 1552–1565, 2020.
  75. F. Betti, J. Staiano, L. Baraldi, L. Baraldi, R. Cucchiara, and N. Sebe, “Let’s vice! mimicking human cognitive behavior in image generation evaluation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 9306–9312.
  76. Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith, “Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 20 406–20 417.
  77. K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,” arXiv preprint arXiv: 2307.06350, 2023.
  78. M. Ku, D. Jiang, C. Wei, X. Yue, and W. Chen, “Viescore: Towards explainable metrics for conditional image synthesis evaluation,” arXiv preprint arXiv:2312.14867, 2023.
  79. Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang, “LLMScore: Unveiling the power of large language models in text-to-image synthesis evaluation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=OJ0c6um1An
  80. C.-Y. Bai, H.-T. Lin, C. Raffel, and W. C.-w. Kan, “On training sample memorization: Lessons from benchmarking generative modeling with a large-scale competition,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2534–2542.
  81. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” in International Conference on Learning Representations, 2018.
  82. D. Lopez-Paz and M. Oquab, “Revisiting classifier two-sample tests,” in International Conference on Learning Representations, 2016.
  83. M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly, “Assessing generative models via precision and recall,” Advances in neural information processing systems, vol. 31, 2018.
  84. T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila, “Improved precision and recall metric for assessing generative models,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  85. S. Gu, J. Bao, D. Chen, and F. Wen, “Giqa: Generated image quality assessment,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16.   Springer, 2020, pp. 369–385.
  86. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  87. S. Ravuri and O. Vinyals, “Classification accuracy score for conditional generative models,” Advances in neural information processing systems, vol. 32, 2019.
  88. X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
  89. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014. [Online]. Available: https://aclanthology.org/Q14-1006
  90. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2641–2649.
  91. H. Singh, P. Zhang, Q. Wang, M. Wang, W. Xiong, J. Du, and Y. Chen, “Coarse-to-fine contrastive learning in image-text-graph space for improved vision-language compositionality,” arXiv preprint arXiv:2305.13812, 2023.
  92. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
  93. R. Hu and A. Singh, “Unit: Multimodal multitask learning with a unified transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 1439–1449.
  94. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European conference on computer vision.   Springer, 2020, pp. 104–120.
  95. Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6616–6628, 2020.
  96. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 5579–5588.
  97. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16.   Springer, 2020, pp. 121–137.
  98. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  99. K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  100. S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola, “Dreamsim: Learning new dimensions of human visual similarity using synthetic data,” arXiv preprint arXiv:2306.09344, 2023.
  101. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22 500–22 510.
  102. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  103. X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer et al., “Pali: A jointly-scaled multilingual language-image model,” arXiv preprint arXiv:2209.06794, 2022.
  104. M. Honnibal and I. Montani, “spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing,” To appear, vol. 7, no. 1, pp. 411–420, 2017.
  105. J. Liang, W. Pei, and F. Lu, “Cpgan: Content-parsing generative adversarial networks for text-to-image synthesis,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 491–508.
  106. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  107. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
  108. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” ArXiv, vol. abs/2304.08485, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258179774
  109. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  110. D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa: Crossing format boundaries with a single qa system,” arXiv preprint arXiv:2005.00700, 2020.
  111. C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao, J. Zhang, S. Huang, F. Huang, J. Zhou, and L. Si, “mPLUG: Effective and efficient vision-language learning by cross-modal skip-connections,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 7241–7259. [Online]. Available: https://aclanthology.org/2022.emnlp-main.488
  112. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  113. M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
  114. J. Wang and R. Gaizauskas, “Generating image descriptions with gold standard visual inputs: Motivation, evaluation and baselines,” in Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), A. Belz, A. Gatt, F. Portet, and M. Purver, Eds.   Brighton, UK: Association for Computational Linguistics, Sep. 2015, pp. 117–126. [Online]. Available: https://aclanthology.org/W15-4722
  115. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37.   Lille, France: PMLR, 07–09 Jul 2015, pp. 957–966. [Online]. Available: https://proceedings.mlr.press/v37/kusnerb15.html
  116. T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
  117. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf
  118. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02.   USA: Association for Computational Linguistics, 2002, p. 311–318. [Online]. Available: https://doi.org/10.3115/1073083.1073135
  119. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
  120. S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, J. Goldstein, A. Lavie, C.-Y. Lin, and C. Voss, Eds.   Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. [Online]. Available: https://aclanthology.org/W05-0909
  121. A. Lavie, K. Sagae, and S. Jayaraman, “The significance of recall in automatic metrics for mt evaluation,” in Machine Translation: From Real Users to Research: 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004. Proceedings 6.   Springer, 2004, pp. 134–143.
  122. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  123. S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
  124. T. Che, Y. Li, A. Jacob, Y. Bengio, and W. Li, “Mode regularized generative adversarial networks,” in International Conference on Learning Representations, 2016.
  125. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, “A kernel method for the two-sample-problem,” Advances in neural information processing systems, vol. 19, 2006.
  126. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  127. A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
  128. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  129. C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting image annotations using amazon’s mechanical turk,” in Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, 2010, pp. 139–147.
  130. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results,” http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html.
  131. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24.   Curran Associates, Inc., 2011. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
  132. C. L. Zitnick and D. Parikh, “Bringing semantics into focus using visual abstraction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  133. H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “nocaps: novel object captioning at scale,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  134. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
  135. J. Mao, J. Xu, Y. Jing, and A. Yuille, “Training and evaluating multimodal word embeddings with large-scale web annotated images,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, ser. NIPS’16.   Red Hook, NY, USA: Curran Associates Inc., 2016, p. 442–450.
  136. J. Kiros, W. Chan, and G. Hinton, “Illustrative language understanding: Large-scale visual grounding with image search,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 922–933.
  137. P. Jenkins, A. Farag, S. Wang, and Z. Li, “Unsupervised representation learning of spatial data via multimodal embedding,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1993–2002.
  138. K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 11 162–11 173.
  139. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
  140. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum, “Flumejava: easy, efficient data-parallel pipelines,” in Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’10.   New York, NY, USA: Association for Computing Machinery, 2010, p. 363–375. [Online]. Available: https://doi.org/10.1145/1806596.1806638
  141. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 3558–3568.
  142. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  143. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  144. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SygXPaEYvH
  145. J. Cho, J. Lei, H. Tan, and M. Bansal, “Unifying vision-and-language tasks via text generation,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 1931–1942. [Online]. Available: https://proceedings.mlr.press/v139/cho21a.html
  146. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International Journal of Computer Vision, vol. 123, pp. 32 – 73, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:4492210
  147. G. A. Miller, “Wordnet: a lexical database for english,” Commun. ACM, vol. 38, no. 11, p. 39–41, nov 1995. [Online]. Available: https://doi.org/10.1145/219717.219748
  148. W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=PUIqjT4rzq7
  149. N. Xie, F. Lai, D. Doran, and A. Kadav, “Visual entailment task for visually-grounded language learning,” 2019.
  150. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  151. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele, “Movie description,” 2016.
  152. Z. Li, M. R. Min, K. Li, and C. Xu, “Stylet2i: Toward compositional and high-fidelity text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 18 197–18 207.
  153. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  154. H. Dong, W. Xiong, D. Goyal, Y. Zhang, W. Chow, R. Pan, S. Diao, J. Zhang, K. SHUM, and T. Zhang, “RAFT: Reward ranked finetuning for generative foundation model alignment,” Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=m7p5O7zblY
  155. K. Lee, H. Liu, M. Ryu, O. Watkins, Y. Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” 2023.
  156. M. Prabhudesai, A. Goyal, D. Pathak, and K. Fragkiadaki, “Aligning text-to-image diffusion models with reward backpropagation,” 2023.
  157. Y. Fan, O. Watkins, Y. Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee, “Reinforcement learning for fine-tuning text-to-image diffusion models,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=8OTPepXzeh
  158. K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine, “Training diffusion models with reinforcement learning,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=YCWjhGrJFD
  159. N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum, “Compositional visual generation with composable diffusion models,” in European Conference on Computer Vision.   Springer, 2022, pp. 423–439.
  160. H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023.
  161. Y. Li, M. Keuper, D. Zhang, and A. Khoreva, “Divide & bind your attention for improved generative semantic nursing,” arXiv preprint arXiv:2307.10864, 2023.
  162. M. Menéndez, J. Pardo, L. Pardo, and M. Pardo, “The jensen-shannon divergence,” Journal of the Franklin Institute, vol. 334, no. 2, pp. 307–318, 1997. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0016003296000634
  163. M. Chen, I. Laina, and A. Vedaldi, “Training-free layout control with cross-attention guidance,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024, pp. 5343–5353.
  164. Z. Ma, J. Hong, M. O. Gul, M. Gandhi, I. Gao, and R. Krishna, “Crepe: Can vision-language foundation models reason compositionally?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 10 910–10 921.
  165. C.-Y. Hsieh, J. Zhang, Z. Ma, A. Kembhavi, and R. Krishna, “Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. [Online]. Available: https://openreview.net/forum?id=Jsc7WSCZd4
  166. A. Ray, F. Radenovic, A. Dubey, B. Plummer, R. Krishna, and K. Saenko, “cola: A benchmark for compositional text-to-image retrieval,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  167. N. Dehouche and K. Dehouche, “What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education,” Heliyon, vol. 9, no. 6, 2023.
  168. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
Citations (1)

Summary

  • The paper introduces a comprehensive taxonomy categorizing T2I quality metrics into image-based and text-conditioned approaches for robust evaluation.
  • It details practical implementations of metrics like FID and CLIPScore, emphasizing their roles in assessing image diversity and text-image alignment.
  • The study highlights challenges such as dataset bias and limited compositional reasoning, urging the development of more nuanced evaluation methods.

A Survey on Quality Metrics for Text-to-Image Generation

Recent advancements in text-to-image (T2I) generation have significantly increased interest in the evaluation of T2I models. These models merge language understanding with image generation capability, pushing forward the need for robust evaluation strategies that align with human judgment. The paper "A Survey on Quality Metrics for Text-to-Image Generation" introduces a comprehensive categorization and analysis of existing quality metrics tailored to this domain.

Introduction

T2I generation involves transforming a textual description into a corresponding image using dual-modality foundation models. This field has evolved with the development of several evaluation metrics designed to assess the quality of images generated based on the semantic and aesthetic alignment with the input text. The paper's core aim is to provide an extensive survey of these metrics, propose a new taxonomy, and offer guidelines for practitioners in model evaluation and selection.

Taxonomy of Quality Metrics

The proposed taxonomy classifies T2I quality metrics into two primary categories: pure image-based metrics and text-conditioned image metrics. The distinction is made based on whether the assessment relies solely on the visual content or includes alignment with textual content.

Image-Based Quality Metrics

  1. Distribution-based Metrics:
    • Inception Score (IS) and Fréchet Inception Distance (FID) are prominent examples that assess the distribution of features in generated images relative to a dataset of real images.
    • These metrics are critical for evaluating the diversity and general quality of generated images.
  2. Single Image Metrics:
    • These measure the aesthetic and perceptual quality of individual images using features like realism, artifact detection, and human preference alignment.

Text-Conditioned Image Quality Metrics

  1. Embedding-Based Metrics:
    • Metrics like CLIPScore, BLIP, and BLIP2 that calculate cosine similarity between text and image embeddings, leveraging large pre-trained models for alignment.
  2. Content-Based Metrics:
    • These metrics involve direct comparison of visual and textual content, often using object detection or visual question answering (VQA) models to evaluate specific elements like spatial and non-spatial relations or attribute binding.

Implementation of Metrics

Several examples are provided within the paper, illustrating practical implementations and theoretical underpinnings of these metrics. For instance, CLIPScore evaluates alignment by computing the cosine similarity of text and image embeddings derived from CLIP, which is trained on a diverse set of text-image pairs. Figure 1

Figure 1: Box plot visualization of the value ranges for each of the normalized image quality scores.

Challenges with Current Metrics

The paper identifies challenges such as the bias introduced by specific training datasets, which may not cover uncommon scenarios, responses to large model architectures, and a lack of fine-grained evaluation for detailed compositional reasoning.

Future Directions

The paper advocates for the development of more nuanced metrics capable of capturing compositional reasoning and human-like understanding. It emphasizes the need for datasets aligning closely with real-world complexity to facilitate more robust assessments of generative models.

Conclusion

The survey offers a foundational framework for future research into T2I quality assessment. By establishing a clear taxonomy and detailing the strengths and weaknesses of current metrics, it sets the stage for advancements that cater to the growing ubiquity and complexity of generative AI applications, such as enriching domains like virtual reality and gaming, where T2I generation is becoming increasingly prevalent.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 40 likes about this paper.