Papers
Topics
Authors
Recent
Search
2000 character limit reached

I2E-Bench: IIE Evaluation Suite

Updated 14 January 2026
  • I2E-Bench is a comprehensive evaluation suite for instruction-based image editing that merges high-level semantic and low-level quality metrics.
  • It integrates 16 evaluation dimensions with automated scoring and rigorous human alignment to ensure robust benchmarking.
  • The suite enables scalable, reproducible assessments that address legacy single-metric limitations and drive nuanced performance analysis.

I2E-Bench is a comprehensive multi-dimensional evaluation suite for instruction-based image editing (IIE) that addresses longstanding challenges in accurately benchmarking models across diverse editing tasks and instruction modalities. Designed to supersede legacy evaluation protocols reliant on single metrics or limited test suites, I2E-Bench integrates high-level and low-level editing dimensions, rigorous human alignment, and a scalable automated scoring pipeline (Ma et al., 2024).

1. Motivation and Benchmark Objectives

Instruction-based image editing has evolved rapidly through methods such as InstructPix2Pix, MagicBrush, MGIE, and InstructDiffusion. Existing evaluation strategies are fragmented: single-metric use (PSNR, SSIM, CLIP) fails to generalize, small datasets like TedBench lack coverage, and human studies are costly with poor reproducibility. I2E-Bench is engineered to fill this void by:

  • Providing a large, statistically robust, multi-dimensional suite encompassing 2,000+ source images and 4,000+ editing instructions (covering original and diverse phrasing).
  • Automating fine-grained evaluation over 16 distinct dimensions—eight high-level (semantic and region-manipulation) and eight low-level (signal restoration)—enabling both granular and holistic assessment.
  • Aligning automated metric outputs to human perceptual judgments via an extensive, dimension-wise user study with high inter-annotator reliability (Cohen’s κ≈0.85).

2. Dataset Composition and Instruction Taxonomy

The benchmark draws source images from diverse public datasets (MS COCO, GoPro, LOL, Dense-Haze, CBSD68, etc.), ensuring a broad representation of content types. Each image is paired with two instruction forms:

  • "Original Instruction": Direct editing request.
  • "Diverse Instruction": Functionally equivalent rewrite to probe instruction robustness.

Instructions are classified into six categories: Animal, Object, Scenery, Plant, Human, and Global. For each of the 16 dimensions, roughly 140 images are selected with both original and diverse instructions, generating ~2,240 image-instruction pairs per instruction type.

Edited outputs are sourced from eight open-source IIE models, ensuring cross-method comparability.

3. Evaluation Dimensions and Automated Scoring Methodology

I2E-Bench splits its design into high-level semantic/region-based edits and low-level global/detail corrections.

High-Level Editing (Evaluated via GPT-4V, CLIP, and annotated masks):

Dimension Metric/Protocol
Counting GPT-4V count query, compared to gold count
Direction Perception GPT-4V quadrant query, binary evaluation
Object Removal GPT-4V object presence query (absence scored 1)
Object Replacement GPT-4V new object confirmation
Background Replacement GPT-4V background description match
Color Alteration GPT-4V color query versus instruction
Style Alteration CLIP similarity to style corpus
Region Accuracy Mask-based edit isolation, SSIM over whited region

Low-Level Editing (Quantitative via SSIM on global image):

Dimension Protocol
Deblurring SSIM with clean reference
Haze Removal SSIM with clean reference
Low-light Enhancement SSIM with clean reference
Noise Removal SSIM with clean reference
Rain Removal SSIM with clean reference
Shadow Removal SSIM with clean reference
Snow Removal SSIM with clean reference
Watermark Removal SSIM with clean reference

For each dimension ii, the score is:

si=1Nj=1Nfi(xj,yj)s_i = \frac{1}{N}\sum_{j=1}^N f_i(x_j, y_j)

where xjx_j is the edited result, yjy_j the instruction, N140N ≈ 140.

4. Human Alignment Protocol and Statistical Validation

Human perceptual alignment is a core pillar of I2E-Bench. For every high-level dimension:

  • A subset (N=140N=140) is judged by trained human annotators, answering the same queries as GPT-4V.
  • Rankings for M=8M=8 model outputs are averaged per dimension, yielding a human score Hi(m)H_i(m) for model mm.
  • Correlation between automated dimension scores si(m)s_i(m) and human scores Hi(m)H_i(m) is computed, yielding Pearson’s ρ > 0.7 across all dimensions (all p < 0.01), confirming statistical alignment.
  • Inter-annotator agreement is high (Cohen’s κ≈0.85).

5. Aggregation, Overall Model Scoring, and Robustness Analysis

Scores are normalized per dimension to [0,1] by empirical min-max rescaling. The aggregate benchmark score for a model is:

Stotal=116i=116siS_{\mathrm{total}} = \frac{1}{16}\sum_{i=1}^{16} s_i

Custom weighted aggregation is supported:

Stotal=i=116wisi(wi=1)S_{\mathrm{total}} = \sum_{i=1}^{16} w_i s_i \quad (\sum w_i = 1)

I2E-Bench also assesses robustness to instruction phrasing:

Si=soisdimin(soi,sdi)S^i = \frac{|s_o^i - s_d^i|}{\min(s_o^i, s_d^i)}

Models using LLM-based editing show <10%10\% score variation across phrasing; conventional models often exceed 30% variation, notably in Object Removal.

Category sensitivity reveals highest performance in Scenery and Global edits (mean S ≈ 0.55) and lower in local object edits (Animal, Human, mean S ≈ 0.40).

6. Comparative Results and Actionable Insights

Test results reveal differentiated strengths among state-of-the-art systems:

  • InstructDiffusion excels on four low-level dimensions (Rain, Haze, Watermark, Snow; e.g., srain=0.672s_{rain}=0.672 vs $0.606$).
  • MGIE leads in Deblurring (sdeblur=0.603s_{deblur}=0.603) over MagicBrush.
  • MagicBrush dominates Region Accuracy (sregion=0.663s_{region}=0.663), Background Replacement (sbkg=0.785s_{bkg}=0.785), and Color Alteration (scol=0.557s_{col}=0.557).
  • InstructAny2Pix leads Counting (scount=0.207s_{count}=0.207) and Style Alteration (sstyle=0.268s_{style}=0.268) on diverse instructions.
  • InstructEdit lags in Object Replacement and Color Alteration.

Qualitative analysis (paper Fig.2) demonstrates consistent, spill-over reductions in MagicBrush and artifact-prone outputs in InstructPix2Pix.

7. Practical Workflow and Benchmark Extension

Users integrate their model outputs into the standardized evaluation pipeline by:

  1. Cloning the repository;
  2. Organizing outputs per instruction type/dimension;
  3. Ensuring source images and instruction JSON presence;
  4. Running the provided Python evaluation script, which leverages OpenAI API for GPT-4V queries, computes SSIM/CLIP scores, and outputs results in CSV format.

This process enables rigorous, reproducible, and fully automated human-aligned evaluation. All dataset components, annotations, and code are open-source (Ma et al., 2024).

8. Significance and Future Directions

I2E-Bench sets a new standard for instruction-based image editing evaluation, integrating multi-dimensional, automatic, and human-aligned protocols at scale. The paradigm supports nuanced analysis across edit types, model architectures, and instruction phrasings. Its actionable insights direct method and dataset design to areas of persistent weakness (e.g., local object edits, instruction sensitivity). The release enables further extension, custom weighting schemes, and cross-benchmark comparability.

Ongoing research avenues include refining perceptual metrics, expanding dataset coverage, integrating more powerful LLMs, and exploring the limits of instruction generalization in visual editing contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to I2E-Bench.