Towards Scalable Human-aligned Benchmark for Text-guided Image Editing

Published 1 May 2025 in cs.CV | (2505.00502v1)

Abstract: A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.

Abstract PDF Upgrade to Chat

Summary

Overview of a Benchmark for Text-guided Image Editing Evaluation

The paper entitled "Towards Scalable Human-aligned Benchmark for Text-guided Image Editing" addresses the crucial need for a standardized evaluation method within the domain of text-guided image editing. Existing methodologies largely rely on subjective user studies, which are limited in scalability and reproducibility. The paper proposes a Human-Aligned Benchmark for Text-guided Image Editing (HATIE), which facilitates comprehensive and reliable evaluation of models across a diverse range of editing tasks.

Introduction to the Problem

Text-guided image editing utilizes recent advances in diffusion-based image generation models. However, the subjectivity inherent in such tasks creates significant challenges in establishing a widely-accepted evaluation standard. The main difficulty lies in the absence of a singular correct output for any given editing instruction. Monitoring deviations between an edited image and its undetermined "ground truth" necessitates a more nuanced approach than traditional pixel-level metrics.

Proposed Solution

HATIE provides a novel benchmark set that encompasses a large range of editing tasks and includes an omnidirectional evaluation pipeline. The evaluation measures various aspects that align closely with human perception, thereby reducing reliance on manual user studies. This attribute of HATIE is empirically verified by aligning the evaluation criteria with results from user studies, showing considerable conformity to human judgments on edit quality.

Evaluation Criteria

HATIE evaluates edited images based on three main criteria: fidelity, consistency, and image quality. These are further refined into five specific metrics: Object Fidelity, Background Fidelity, Object Consistency, Background Consistency, and Image Quality. Object Fidelity assesses whether the edit was accurately applied to the intended object while preserving its identity when required. Background Fidelity ensures that the desired background change aligns with the instruction. Object and Background Consistency measure the preservation of the original scene's integrity, both in terms of object and background features. Finally, Image Quality evaluates the overall realism and feasibility of the edited output.

Benchmark Dataset and Queries

The HATIE framework is complemented by a robust dataset, sourced from the foundational GQA dataset for VQA tasks, filtered to remove objects that are indistinct or partially obstructed. The dataset includes detailed annotations for objects and relations, facilitating the generation of diverse and contextually relevant edit queries. Queries are structured either as description-based or instruction-based, ensuring compatibility across different model types and effectively covering distinct queries including object-centric tasks like addition, removal, resizing, and attribute change, as well as non-object-centric tasks like background or style changes.

Results

The HATIE benchmark results reveal nuanced insights into competing models' performance under varied edit intensities. Notably, the benchmark is sensitive enough to distinguish fine-grained differences in model output, identifying optimal parameter settings for various models and edit types.

Conclusion and Implications

HATIE significantly advances the field of text-guided image editing by offering a scalable, human-aligned benchmark that improves upon the limitations of subjective user studies. The structured evaluation framework highlights areas of strength and weakness across models, paving the way for future research and development. While some tasks are currently excluded due to dataset limitations, the benchmark framework promises expansion to accommodate a wider spectrum of editing types in future iterations.