A Survey on Quality Metrics for Text-to-Image Generation

Published 18 Mar 2024 in cs.CV, cs.AI, and cs.GR | (2403.11821v5)

Abstract: AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques, that offer precise control over scene parameters (e.g., objects, materials, and lighting). While the quality of conventionally rendered images is assessed through well established image quality metrics, such as SSIM or PSNR, the unique challenges of text-to-image generation require other, dedicated quality metrics. These metrics must be able to not only measure overall image quality, but also how well images reflect given text prompts, whereby the control of scene and rendering parameters is interweaved. Within this survey, we provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics. Our taxonomy is grounded in the assumption, that there are two main quality criteria, namely compositional quality and general quality, that contribute to the overall image quality. Besides the metrics, this survey covers dedicated text-to-image benchmark datasets, over which the metrics are frequently computed. Finally, we identify limitations and open challenges in the field of text-to-image generation, and derive guidelines for practitioners conducting text-to-image evaluation.

Abstract PDF HTML Upgrade to Chat

References (168)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a comprehensive taxonomy categorizing T2I quality metrics into image-based and text-conditioned approaches for robust evaluation.
It details practical implementations of metrics like FID and CLIPScore, emphasizing their roles in assessing image diversity and text-image alignment.
The study highlights challenges such as dataset bias and limited compositional reasoning, urging the development of more nuanced evaluation methods.

A Survey on Quality Metrics for Text-to-Image Generation

Recent advancements in text-to-image (T2I) generation have significantly increased interest in the evaluation of T2I models. These models merge language understanding with image generation capability, pushing forward the need for robust evaluation strategies that align with human judgment. The paper "A Survey on Quality Metrics for Text-to-Image Generation" introduces a comprehensive categorization and analysis of existing quality metrics tailored to this domain.

Introduction

T2I generation involves transforming a textual description into a corresponding image using dual-modality foundation models. This field has evolved with the development of several evaluation metrics designed to assess the quality of images generated based on the semantic and aesthetic alignment with the input text. The paper's core aim is to provide an extensive survey of these metrics, propose a new taxonomy, and offer guidelines for practitioners in model evaluation and selection.

Taxonomy of Quality Metrics

The proposed taxonomy classifies T2I quality metrics into two primary categories: pure image-based metrics and text-conditioned image metrics. The distinction is made based on whether the assessment relies solely on the visual content or includes alignment with textual content.

Image-Based Quality Metrics

Distribution-based Metrics:
- Inception Score (IS) and Fréchet Inception Distance (FID) are prominent examples that assess the distribution of features in generated images relative to a dataset of real images.
- These metrics are critical for evaluating the diversity and general quality of generated images.
Single Image Metrics:
- These measure the aesthetic and perceptual quality of individual images using features like realism, artifact detection, and human preference alignment.

Text-Conditioned Image Quality Metrics

Embedding-Based Metrics:
- Metrics like CLIPScore, BLIP, and BLIP2 that calculate cosine similarity between text and image embeddings, leveraging large pre-trained models for alignment.
Content-Based Metrics:
- These metrics involve direct comparison of visual and textual content, often using object detection or visual question answering (VQA) models to evaluate specific elements like spatial and non-spatial relations or attribute binding.

Implementation of Metrics

Several examples are provided within the paper, illustrating practical implementations and theoretical underpinnings of these metrics. For instance, CLIPScore evaluates alignment by computing the cosine similarity of text and image embeddings derived from CLIP, which is trained on a diverse set of text-image pairs.

Figure 1: Box plot visualization of the value ranges for each of the normalized image quality scores.

Challenges with Current Metrics

The paper identifies challenges such as the bias introduced by specific training datasets, which may not cover uncommon scenarios, responses to large model architectures, and a lack of fine-grained evaluation for detailed compositional reasoning.

Future Directions

The paper advocates for the development of more nuanced metrics capable of capturing compositional reasoning and human-like understanding. It emphasizes the need for datasets aligning closely with real-world complexity to facilitate more robust assessments of generative models.

Conclusion

The survey offers a foundational framework for future research into T2I quality assessment. By establishing a clear taxonomy and detailing the strengths and weaknesses of current metrics, it sets the stage for advancements that cater to the growing ubiquity and complexity of generative AI applications, such as enriching domains like virtual reality and gaming, where T2I generation is becoming increasingly prevalent.

Markdown Report Issue