CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Published 16 May 2025 in cs.CV and cs.CL | (2505.11314v1)

Abstract: The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

Evaluating and Training Text-to-Image Metrics with CROC Framework

The paper presents a significant advancement in the evaluation robustness of metrics used in Text-to-Image (T2I) tasks, focusing on systematic, contrastive methods that improve the accuracy and scalability of existing approaches. It introduces CROC, a framework designed to automate Contrastive Robustness Checks in T2I evaluation metrics. This framework systematically interrogates and measures metric resilience by generating contrastive test cases across a detailed taxonomy of image properties, thereby bridging the gap in automated meta-evaluation methods.

Overview and Methodology

CROC generates a pseudo-labeled dataset, termed CROC $^{syn}$ , comprising over one million text-image pairs. This expansive dataset enables granular comparisons among evaluation metrics, facilitating a nuanced understanding of metric capabilities and shortcomings. Additionally, the framework uses CROC $^{syn}$ data to train CROCScore, a new metric that achieves superior performance among open-source methods.

The paper also introduces a human-supervised dataset, CROC $^{hum}$ , focusing on challenging categories to complement CROC $^{syn}$ . This dataset targets specific difficult-to-generate test cases where existing metrics fall short, such as handling negations or correctly identifying body parts in generated images.

Central to CROC is the innovative contrastive evaluation approach. It measures how well T2I metrics differentiate between matching and non-matching text-image pairs across various evaluation directions: text-based versus image-based evaluations, and forward versus inverse evaluations. These evaluations challenge metrics to rate the match quality in a controlled setup, providing a comprehensive analysis of their robustness.

Results and Implications

The results presented in the paper highlight weaknesses in current evaluation metrics. For instance, existing metrics often fail on prompts involving negation, and all tested open-source metrics incorrectly identify correct body parts in at least 25% of cases. This underscores the necessity for developing more robust evaluation frameworks like CROC.

Given the scalability and sensitivity of the CROC framework, these findings have several practical implications. Improvement in T2I metrics can enhance applications ranging from art creation to industrial design, where precise assessment of generative outputs is critical. The customizable nature of CROC allows adaptation to specific domains requiring tailored metric comparisons, offering potential advancements in industry-specific T2I applications.

The theoretical implications suggest a need for reevaluating training paradigms for T2I metrics, potentially leveraging pseudo-labeled datasets like CROC $^{syn}$ to refine metric performance further. The strong performance of CROCScore indicates promising avenues for metric enhancement, especially through the use of contrastive robustness checks.

Future Directions

Looking ahead, the paper opens the door for continuing research into optimizing configuration parameters for even more efficient T2I metric training. Additionally, the framework's robustness analysis could be extended to other generative modalities in AI, offering a pathway for broadening the applicability of contrastive evaluations beyond image synthesis.

Furthermore, leveraging higher-quality T2I models for generating pseudo-labeled datasets could push metrics closer to human judgment capabilities, reducing discrepancies and improving overall evaluations. Advancements in automated generation and meta-evaluation methodologies herald a transformative period for T2I evaluation metrics, fostering innovation and more reliable generative AI applications.

Markdown Report Issue