I2E-Bench: IIE Evaluation Suite

Updated 14 January 2026

I2E-Bench is a comprehensive evaluation suite for instruction-based image editing that merges high-level semantic and low-level quality metrics.
It integrates 16 evaluation dimensions with automated scoring and rigorous human alignment to ensure robust benchmarking.
The suite enables scalable, reproducible assessments that address legacy single-metric limitations and drive nuanced performance analysis.

I2E-Bench is a comprehensive multi-dimensional evaluation suite for instruction-based image editing (IIE) that addresses longstanding challenges in accurately benchmarking models across diverse editing tasks and instruction modalities. Designed to supersede legacy evaluation protocols reliant on single metrics or limited test suites, I2E-Bench integrates high-level and low-level editing dimensions, rigorous human alignment, and a scalable automated scoring pipeline (Ma et al., 2024).

1. Motivation and Benchmark Objectives

Instruction-based image editing has evolved rapidly through methods such as InstructPix2Pix, MagicBrush, MGIE, and InstructDiffusion. Existing evaluation strategies are fragmented: single-metric use (PSNR, SSIM, CLIP) fails to generalize, small datasets like TedBench lack coverage, and human studies are costly with poor reproducibility. I2E-Bench is engineered to fill this void by:

Providing a large, statistically robust, multi-dimensional suite encompassing 2,000+ source images and 4,000+ editing instructions (covering original and diverse phrasing).
Automating fine-grained evaluation over 16 distinct dimensions—eight high-level (semantic and region-manipulation) and eight low-level (signal restoration)—enabling both granular and holistic assessment.
Aligning automated metric outputs to human perceptual judgments via an extensive, dimension-wise user study with high inter-annotator reliability (Cohen’s κ≈0.85).

2. Dataset Composition and Instruction Taxonomy

The benchmark draws source images from diverse public datasets (MS COCO, GoPro, LOL, Dense-Haze, CBSD68, etc.), ensuring a broad representation of content types. Each image is paired with two instruction forms:

"Original Instruction": Direct editing request.
"Diverse Instruction": Functionally equivalent rewrite to probe instruction robustness.

Instructions are classified into six categories: Animal, Object, Scenery, Plant, Human, and Global. For each of the 16 dimensions, roughly 140 images are selected with both original and diverse instructions, generating ~2,240 image-instruction pairs per instruction type.

Edited outputs are sourced from eight open-source IIE models, ensuring cross-method comparability.

3. Evaluation Dimensions and Automated Scoring Methodology

I2E-Bench splits its design into high-level semantic/region-based edits and low-level global/detail corrections.

High-Level Editing (Evaluated via GPT-4V, CLIP, and annotated masks):

Dimension	Metric/Protocol
Counting	GPT-4V count query, compared to gold count
Direction Perception	GPT-4V quadrant query, binary evaluation
Object Removal	GPT-4V object presence query (absence scored 1)
Object Replacement	GPT-4V new object confirmation
Background Replacement	GPT-4V background description match
Color Alteration	GPT-4V color query versus instruction
Style Alteration	CLIP similarity to style corpus
Region Accuracy	Mask-based edit isolation, SSIM over whited region

Low-Level Editing (Quantitative via SSIM on global image):

Dimension	Protocol
Deblurring	SSIM with clean reference
Haze Removal	SSIM with clean reference
Low-light Enhancement	SSIM with clean reference
Noise Removal	SSIM with clean reference
Rain Removal	SSIM with clean reference
Shadow Removal	SSIM with clean reference
Snow Removal	SSIM with clean reference
Watermark Removal	SSIM with clean reference

For each dimension $i$ , the score is:

$s_i = \frac{1}{N}\sum_{j=1}^N f_i(x_j, y_j)$

where $x_j$ is the edited result, $y_j$ the instruction, $N ≈ 140$ .

4. Human Alignment Protocol and Statistical Validation

Human perceptual alignment is a core pillar of I2E-Bench. For every high-level dimension:

A subset ( $N=140$ ) is judged by trained human annotators, answering the same queries as GPT-4V.
Rankings for $M=8$ model outputs are averaged per dimension, yielding a human score $H_i(m)$ for model $m$ .
Correlation between automated dimension scores $s_i(m)$ and human scores $H_i(m)$ is computed, yielding Pearson’s ρ > 0.7 across all dimensions (all p < 0.01), confirming statistical alignment.
Inter-annotator agreement is high (Cohen’s κ≈0.85).

5. Aggregation, Overall Model Scoring, and Robustness Analysis

Scores are normalized per dimension to [0,1] by empirical min-max rescaling. The aggregate benchmark score for a model is:

$S_{\mathrm{total}} = \frac{1}{16}\sum_{i=1}^{16} s_i$

Custom weighted aggregation is supported:

$S_{\mathrm{total}} = \sum_{i=1}^{16} w_i s_i \quad (\sum w_i = 1)$

I2E-Bench also assesses robustness to instruction phrasing:

$S^i = \frac{|s_o^i - s_d^i|}{\min(s_o^i, s_d^i)}$

Models using LLM-based editing show < $10\%$ score variation across phrasing; conventional models often exceed 30% variation, notably in Object Removal.

Category sensitivity reveals highest performance in Scenery and Global edits (mean S ≈ 0.55) and lower in local object edits (Animal, Human, mean S ≈ 0.40).

6. Comparative Results and Actionable Insights

Test results reveal differentiated strengths among state-of-the-art systems:

InstructDiffusion excels on four low-level dimensions (Rain, Haze, Watermark, Snow; e.g., $s_{rain}=0.672$ vs $0.606$).
MGIE leads in Deblurring ( $s_{deblur}=0.603$ ) over MagicBrush.
MagicBrush dominates Region Accuracy ( $s_{region}=0.663$ ), Background Replacement ( $s_{bkg}=0.785$ ), and Color Alteration ( $s_{col}=0.557$ ).
InstructAny2Pix leads Counting ( $s_{count}=0.207$ ) and Style Alteration ( $s_{style}=0.268$ ) on diverse instructions.
InstructEdit lags in Object Replacement and Color Alteration.

Qualitative analysis (paper Fig.2) demonstrates consistent, spill-over reductions in MagicBrush and artifact-prone outputs in InstructPix2Pix.

7. Practical Workflow and Benchmark Extension

Users integrate their model outputs into the standardized evaluation pipeline by:

Cloning the repository;
Organizing outputs per instruction type/dimension;
Ensuring source images and instruction JSON presence;
Running the provided Python evaluation script, which leverages OpenAI API for GPT-4V queries, computes SSIM/CLIP scores, and outputs results in CSV format.

This process enables rigorous, reproducible, and fully automated human-aligned evaluation. All dataset components, annotations, and code are open-source (Ma et al., 2024).

8. Significance and Future Directions

I2E-Bench sets a new standard for instruction-based image editing evaluation, integrating multi-dimensional, automatic, and human-aligned protocols at scale. The paradigm supports nuanced analysis across edit types, model architectures, and instruction phrasings. Its actionable insights direct method and dataset design to areas of persistent weakness (e.g., local object edits, instruction sensitivity). The release enables further extension, custom weighting schemes, and cross-benchmark comparability.

Ongoing research avenues include refining perceptual metrics, expanding dataset coverage, integrating more powerful LLMs, and exploring the limits of instruction generalization in visual editing contexts.

Markdown Report Issue Upgrade to Chat

References (1)

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to I2E-Bench.