I2E-Bench: IIE Evaluation Suite
- I2E-Bench is a comprehensive evaluation suite for instruction-based image editing that merges high-level semantic and low-level quality metrics.
- It integrates 16 evaluation dimensions with automated scoring and rigorous human alignment to ensure robust benchmarking.
- The suite enables scalable, reproducible assessments that address legacy single-metric limitations and drive nuanced performance analysis.
I2E-Bench is a comprehensive multi-dimensional evaluation suite for instruction-based image editing (IIE) that addresses longstanding challenges in accurately benchmarking models across diverse editing tasks and instruction modalities. Designed to supersede legacy evaluation protocols reliant on single metrics or limited test suites, I2E-Bench integrates high-level and low-level editing dimensions, rigorous human alignment, and a scalable automated scoring pipeline (Ma et al., 2024).
1. Motivation and Benchmark Objectives
Instruction-based image editing has evolved rapidly through methods such as InstructPix2Pix, MagicBrush, MGIE, and InstructDiffusion. Existing evaluation strategies are fragmented: single-metric use (PSNR, SSIM, CLIP) fails to generalize, small datasets like TedBench lack coverage, and human studies are costly with poor reproducibility. I2E-Bench is engineered to fill this void by:
- Providing a large, statistically robust, multi-dimensional suite encompassing 2,000+ source images and 4,000+ editing instructions (covering original and diverse phrasing).
- Automating fine-grained evaluation over 16 distinct dimensions—eight high-level (semantic and region-manipulation) and eight low-level (signal restoration)—enabling both granular and holistic assessment.
- Aligning automated metric outputs to human perceptual judgments via an extensive, dimension-wise user study with high inter-annotator reliability (Cohen’s κ≈0.85).
2. Dataset Composition and Instruction Taxonomy
The benchmark draws source images from diverse public datasets (MS COCO, GoPro, LOL, Dense-Haze, CBSD68, etc.), ensuring a broad representation of content types. Each image is paired with two instruction forms:
- "Original Instruction": Direct editing request.
- "Diverse Instruction": Functionally equivalent rewrite to probe instruction robustness.
Instructions are classified into six categories: Animal, Object, Scenery, Plant, Human, and Global. For each of the 16 dimensions, roughly 140 images are selected with both original and diverse instructions, generating ~2,240 image-instruction pairs per instruction type.
Edited outputs are sourced from eight open-source IIE models, ensuring cross-method comparability.
3. Evaluation Dimensions and Automated Scoring Methodology
I2E-Bench splits its design into high-level semantic/region-based edits and low-level global/detail corrections.
High-Level Editing (Evaluated via GPT-4V, CLIP, and annotated masks):
| Dimension | Metric/Protocol |
|---|---|
| Counting | GPT-4V count query, compared to gold count |
| Direction Perception | GPT-4V quadrant query, binary evaluation |
| Object Removal | GPT-4V object presence query (absence scored 1) |
| Object Replacement | GPT-4V new object confirmation |
| Background Replacement | GPT-4V background description match |
| Color Alteration | GPT-4V color query versus instruction |
| Style Alteration | CLIP similarity to style corpus |
| Region Accuracy | Mask-based edit isolation, SSIM over whited region |
Low-Level Editing (Quantitative via SSIM on global image):
| Dimension | Protocol |
|---|---|
| Deblurring | SSIM with clean reference |
| Haze Removal | SSIM with clean reference |
| Low-light Enhancement | SSIM with clean reference |
| Noise Removal | SSIM with clean reference |
| Rain Removal | SSIM with clean reference |
| Shadow Removal | SSIM with clean reference |
| Snow Removal | SSIM with clean reference |
| Watermark Removal | SSIM with clean reference |
For each dimension , the score is:
where is the edited result, the instruction, .
4. Human Alignment Protocol and Statistical Validation
Human perceptual alignment is a core pillar of I2E-Bench. For every high-level dimension:
- A subset () is judged by trained human annotators, answering the same queries as GPT-4V.
- Rankings for model outputs are averaged per dimension, yielding a human score for model .
- Correlation between automated dimension scores and human scores is computed, yielding Pearson’s ρ > 0.7 across all dimensions (all p < 0.01), confirming statistical alignment.
- Inter-annotator agreement is high (Cohen’s κ≈0.85).
5. Aggregation, Overall Model Scoring, and Robustness Analysis
Scores are normalized per dimension to [0,1] by empirical min-max rescaling. The aggregate benchmark score for a model is:
Custom weighted aggregation is supported:
I2E-Bench also assesses robustness to instruction phrasing:
Models using LLM-based editing show < score variation across phrasing; conventional models often exceed 30% variation, notably in Object Removal.
Category sensitivity reveals highest performance in Scenery and Global edits (mean S ≈ 0.55) and lower in local object edits (Animal, Human, mean S ≈ 0.40).
6. Comparative Results and Actionable Insights
Test results reveal differentiated strengths among state-of-the-art systems:
- InstructDiffusion excels on four low-level dimensions (Rain, Haze, Watermark, Snow; e.g., vs $0.606$).
- MGIE leads in Deblurring () over MagicBrush.
- MagicBrush dominates Region Accuracy (), Background Replacement (), and Color Alteration ().
- InstructAny2Pix leads Counting () and Style Alteration () on diverse instructions.
- InstructEdit lags in Object Replacement and Color Alteration.
Qualitative analysis (paper Fig.2) demonstrates consistent, spill-over reductions in MagicBrush and artifact-prone outputs in InstructPix2Pix.
7. Practical Workflow and Benchmark Extension
Users integrate their model outputs into the standardized evaluation pipeline by:
- Cloning the repository;
- Organizing outputs per instruction type/dimension;
- Ensuring source images and instruction JSON presence;
- Running the provided Python evaluation script, which leverages OpenAI API for GPT-4V queries, computes SSIM/CLIP scores, and outputs results in CSV format.
This process enables rigorous, reproducible, and fully automated human-aligned evaluation. All dataset components, annotations, and code are open-source (Ma et al., 2024).
8. Significance and Future Directions
I2E-Bench sets a new standard for instruction-based image editing evaluation, integrating multi-dimensional, automatic, and human-aligned protocols at scale. The paradigm supports nuanced analysis across edit types, model architectures, and instruction phrasings. Its actionable insights direct method and dataset design to areas of persistent weakness (e.g., local object edits, instruction sensitivity). The release enables further extension, custom weighting schemes, and cross-benchmark comparability.
Ongoing research avenues include refining perceptual metrics, expanding dataset coverage, integrating more powerful LLMs, and exploring the limits of instruction generalization in visual editing contexts.