Develop more accurate evaluation metrics for image editing that capture subject consistency

Develop and validate objective evaluation metrics for image editing that accurately measure subject consistency (i.e., preservation of a subject’s identity and appearance across edits), to address deficiencies in current benchmark assessments of region-based editing tasks.

Background

In constructing the image editing dataset, the authors observed that training with GPT-Image-Edit-1.5M can improve benchmark scores but severely disrupts character (subject) consistency, leading them to exclude this dataset from training.

This highlighted a broader limitation in existing evaluation practices for image editing: current metrics insufficiently capture subject consistency, especially for region-based edits, motivating the need for more accurate metrics.

References

This observation also highlights the need for more accurate evaluation metrics in image editing (such as subject consistency), which is left for future work.

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture  (2512.04810 - He et al., 4 Dec 2025) in Section 3, Image Editing Data (IT2I)