GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Published 17 Oct 2023 in cs.CV and cs.LG | (2310.11513v1)

Abstract: Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

Abstract PDF HTML Upgrade to Chat

References (44)

Citations (44)

View on Semantic Scholar

Summary

The paper introduces GenEval, a modular framework that uses object detection to evaluate text-to-image alignment at object-level granularity.
It demonstrates improved alignment with human judgments and benchmarks T2I model performance using tasks like object counting, positioning, and attribute binding.
The study highlights challenges in spatial relations and attribute misbindings, pointing to directions for future T2I model enhancements.

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Introduction

The paper introduces GenEval, a novel framework designed to evaluate text-to-image (T2I) alignment with a focus on object-level properties. Traditional metrics like FID and CLIPScore often fall short in providing fine-grained analysis required for instance-level and compositional evaluation. GenEval leverages object detection models, using bounding boxes and segmentation masks to assess various elements specified in the text prompt, including object presence, count, position, and attributes such as color. This modular approach allows for a more interpretable and detailed analysis of T2I models.

Figure 1: Visualization of GenEval. Modern object detection models automatically verify text-to-image generations, using bounding boxes and segmentation masks to assess features like object presence, count, and color.

Evaluation Framework

GenEval relies on integrating object detection with discriminative vision models to assess image properties. The framework breaks down text prompts into multiple tasks, including single object rendering, counting, position, and attribute binding. For each task, GenEval systematically uses object detection outputs to verify prompt specifications, passing intermediate data to additional models when necessary for tasks like color classification. This layered evaluation allows for a comprehensive assessment of T2I model capabilities.

Human Evaluation Study

The framework's alignment with human judgment was verified through a study involving fine-grained annotations of AI-generated images. GenEval achieved a high agreement rate with human evaluators, outperforming CLIPScore, particularly in complex tasks such as those requiring spatial reasoning and attribute binding.

Figure 2: Human study agreement results. GenEval demonstrated higher agreement with human annotators on complex tasks compared to CLIPScore.

Model Benchmarking

GenEval was applied to assess the capabilities of several open-source T2I models. Notably, the IF model, with enhancements in text encoder size and diffusion mechanisms, outperformed prior models like Stable Diffusion. However, tasks related to spatial relations and attribute binding remain challenging, indicating areas requiring further development.

Figure 3: (Left) Change in model performance IF model scales. GenEval scores increased with model size, particularly for complex compositional tasks.

Limitations and Failure Modes

Despite its advantages, GenEval is limited by the object detectors it uses, specifically when dealing with art-like images or overlapping objects. The framework exposed several failure modes common in current T2I models, such as position biases and attribute misbindings. These insights are crucial for informing future model development.

Figure 4: Failure modes of T2I models, exemplifying bias and inaccuracies in spatial and attribute tasks.

Conclusion

GenEval represents a significant step forward in evaluating T2I models, providing a robust framework that aligns well with human judgment and exposes critical areas for improvement. By dissecting image generation tasks into manageable components, GenEval can help developers enhance T2I model performance on finer-grained and more complex tasks in future iterations. The public availability of GenEval promises further exploration and innovation in the field.