Papers
Topics
Authors
Recent
Search
2000 character limit reached

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Published 17 Oct 2023 in cs.CV and cs.LG | (2310.11513v1)

Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. URL https://openai.com/dall-e-2.
  2. URL https://www.midjourney.com/.
  3. Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
  4. Basic color terms: Their universality and evolution. CLSI Publ., 2000.
  5. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  6. X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models, 2023.
  7. Masked-attention mask transformer for universal image segmentation. 2022.
  8. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  9. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. 2022.
  10. Dall·e mini, 7 2021. URL https://github.com/borisdayma/dalle-mini.
  11. Deep-Floyd. Deep-floyd/if. URL https://github.com/deep-floyd/IF.
  12. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  13. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PUIqjT4rzq7.
  14. Benchmarking spatial relationships in text-to-image generation, 2022.
  15. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  16. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  17. Semantic object accuracy for generative text-to-image synthesis. IEEE transactions on pattern analysis and machine intelligence, 44(3):1552–1565, 2020.
  18. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7986–7994, 2018.
  19. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
  20. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  21. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023.
  22. Kuprel. Kuprel/min-dalle: Min(dall·e) is a fast, minimal port of dall·e mini to pytorch. URL https://github.com/kuprel/min-dalle.
  23. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
  24. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation, 2023.
  25. Generating images from captions with attention. In ICLR, 2016.
  26. Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230, 2022.
  27. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  28. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pp.  311–318, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://doi.org/10.3115/1073083.1073135.
  29. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=bKBhQhPeKaF.
  30. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  33. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8821–8831. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/ramesh21a.html.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  35. Generative adversarial text to image synthesis. CoRR, abs/1605.05396, 2016. URL http://arxiv.org/abs/1605.05396.
  36. High-resolution image synthesis with latent diffusion models, 2021.
  37. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  38. Improved techniques for training gans. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  2234–2242, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  39. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=M3Y74vmsMcY.
  40. Cider: Consensus-based image description evaluation. CoRR, abs/1411.5726, 2014. URL http://arxiv.org/abs/1411.5726.
  41. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023.
  42. Imagereward: Learning and evaluating human preferences for text-to-image generation, 2023.
  43. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017. URL http://arxiv.org/abs/1711.10485.
  44. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=AFDcYJKhND. Featured Certification.
Citations (44)

Summary

  • The paper introduces GenEval, a modular framework that uses object detection to evaluate text-to-image alignment at object-level granularity.
  • It demonstrates improved alignment with human judgments and benchmarks T2I model performance using tasks like object counting, positioning, and attribute binding.
  • The study highlights challenges in spatial relations and attribute misbindings, pointing to directions for future T2I model enhancements.

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Introduction

The paper introduces GenEval, a novel framework designed to evaluate text-to-image (T2I) alignment with a focus on object-level properties. Traditional metrics like FID and CLIPScore often fall short in providing fine-grained analysis required for instance-level and compositional evaluation. GenEval leverages object detection models, using bounding boxes and segmentation masks to assess various elements specified in the text prompt, including object presence, count, position, and attributes such as color. This modular approach allows for a more interpretable and detailed analysis of T2I models. Figure 1

Figure 1: Visualization of GenEval. Modern object detection models automatically verify text-to-image generations, using bounding boxes and segmentation masks to assess features like object presence, count, and color.

Evaluation Framework

GenEval relies on integrating object detection with discriminative vision models to assess image properties. The framework breaks down text prompts into multiple tasks, including single object rendering, counting, position, and attribute binding. For each task, GenEval systematically uses object detection outputs to verify prompt specifications, passing intermediate data to additional models when necessary for tasks like color classification. This layered evaluation allows for a comprehensive assessment of T2I model capabilities.

Human Evaluation Study

The framework's alignment with human judgment was verified through a study involving fine-grained annotations of AI-generated images. GenEval achieved a high agreement rate with human evaluators, outperforming CLIPScore, particularly in complex tasks such as those requiring spatial reasoning and attribute binding. Figure 2

Figure 2: Human study agreement results. GenEval demonstrated higher agreement with human annotators on complex tasks compared to CLIPScore.

Model Benchmarking

GenEval was applied to assess the capabilities of several open-source T2I models. Notably, the IF model, with enhancements in text encoder size and diffusion mechanisms, outperformed prior models like Stable Diffusion. However, tasks related to spatial relations and attribute binding remain challenging, indicating areas requiring further development. Figure 3

Figure 3

Figure 3: (Left) Change in model performance IF model scales. GenEval scores increased with model size, particularly for complex compositional tasks.

Limitations and Failure Modes

Despite its advantages, GenEval is limited by the object detectors it uses, specifically when dealing with art-like images or overlapping objects. The framework exposed several failure modes common in current T2I models, such as position biases and attribute misbindings. These insights are crucial for informing future model development. Figure 4

Figure 4

Figure 4: Failure modes of T2I models, exemplifying bias and inaccuracies in spatial and attribute tasks.

Conclusion

GenEval represents a significant step forward in evaluating T2I models, providing a robust framework that aligns well with human judgment and exposes critical areas for improvement. By dissecting image generation tasks into manageable components, GenEval can help developers enhance T2I model performance on finer-grained and more complex tasks in future iterations. The public availability of GenEval promises further exploration and innovation in the field.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 5 likes about this paper.