EditCanvas Benchmark for VLM UI Design

Updated 7 January 2026

EditCanvas Benchmark is a large-scale evaluation suite for VLMs, assessing iterative UI replication and modification tasks using Figma workflows.
It comprises 3,327 mobile UI screens and 598 design tasks, employing dual-tier metrics for perceptual and component-wise similarity.
The benchmark offers insights into tool selection, error analysis, and iterative design, guiding improvements in agentic VLM performance.

The EditCanvas benchmark, also referred to as CANVAS, is a large-scale, task-oriented evaluation suite for measuring the proficiency of vision-LLMs (VLMs) in tool-based user interface (UI) design. The benchmark focuses on models' abilities to interact with professional design software (specifically Figma) through iterative, context-based tool invocations aimed at replicating or modifying mobile UI screens. EditCanvas establishes quantitative standards for the evaluation of agentic VLMs in practical design workflows, encompassing data, task design, tool-interfacing protocols, formalized metrics, and error analyses (Jeong et al., 25 Nov 2025).

1. Dataset Composition and Structure

EditCanvas is constructed from a collection of 3,327 raw mobile-app UI screens sourced from the Figma Community under a CC BY 4.0 license. These source screens span 30 function-based categories, covering common user-facing scenarios such as onboarding and authentication, home and feed layouts, messaging, search, cart, and checkout screens. The dataset is stratified into 598 tool-based design tasks, subdivided into:

298 design-replication tasks (full-screen reconstruction from an empty canvas).
300 design-modification tasks (targeted updates or insertions to existing screens).

Each screen is annotated with a category label, generated by GPT-4.1-Mini and refined via manual review. For modification tasks, specific features (e.g., rounded corners, insertable buttons, mode-switching between light and dark themes) are explicitly targeted through careful manual sampling.

Ground-truth references consist of parallel pairs of Figma states, each exported as both PNG rasterizations and JSON trees. Task instructions, describing the required modifications or replications, are generated via GPT-4.1-Mini from visual “diff” procedures and manually refined by expert designers. For every design-modification case, annotators specify the minimal set of Figma tools necessary for a valid solution (Jeong et al., 25 Nov 2025).

2. Task Definitions and Tool-Based Workflow

EditCanvas operationalizes the UI design process into two primary task classes:

Design Replication: Given a target screenshot and canvas size, the model sequentially issues tool-invocation commands to reconstruct the target UI screen from an empty starting state ( $s_0 = \varnothing$ ). The episode terminates when the sequence of operations produces a state $s_T$ approximating the provided ground-truth $s_{GT}$ .
Design Modification: The model is presented with an initial state $s_{old}$ $s_{o l d}$ and modifies it into a target state $s_{GT}$ $s_{GT}$ by executing an edit script $\Delta$ $Δ$ . This script may consist of attribute updates, component insertions, or color mode changes:
- Attribute Update: $\Delta_{\mathrm{attr}} = \{ (c_k, a_k, v'_k) \}$ , where $c_k$ is a component, $a_k$ is the attribute, $v'_k$ is the updated value.
- Component Insertion: $\Delta_{\mathrm{add}} = \{ (c_k, a_k, v_k) \ }$, for new components.
- Mode Change: $\Delta_{\mathrm{col}} = \{ ( \mathrm{rgb}^{old}_k, \mathrm{rgb}^{new}_k ) \}$ , for color theme conversion.

All task execution occurs through a “ReAct” style workflow:

The model is prompted with the current screen context.
It generates a “Thought” (internal reasoning).
Outputs an “Action” as a JSON-encoded tool call (e.g., create_rectangle, set_fill_color).
The action is executed via a Figma plugin, yielding a new observation.
The process is iterated until model termination.

The toolset comprises 50 Figma operations, divided among creation, inspection, layout, operation, style, and text manipulation categories. This setup enables fine-grained evaluation of design reasoning, multi-step planning, and precise tool selection in a realistic software environment (Jeong et al., 25 Nov 2025).

3. Evaluation Metrics

EditCanvas employs a dual-level approach for quantitative assessment: perceptual similarity and component-wise structural matching.

Perceptual Similarity:

Feature-level (SSIM): $\mathrm{SIM}_{\mathrm{feat}}(s_t, s_{GT}) = \mathrm{SSIM}(I_t, I_{GT})$ , comparing structural similarity between generated and ground-truth images.
Pattern-level (Saliency correlation): $\mathrm{SIM}_{\mathrm{pat}}(s_t, s_{GT}) = \mathrm{CC}( \Phi(I_t), \Phi(I_{GT}) )$ , where $\Phi(\cdot)$ denotes UMSI++ saliency maps, evaluated via Pearson's correlation.
Object-level (BLIPScore):

$\mathrm{SIM}_{obj}(s_t,s_{GT}) = \frac{\psi(I_t)\cdot\psi(I_{GT})}{\|\psi(I_t)\|\;\|\psi(I_{GT})\|},$

with $\psi(\cdot)$ being BLIP-2 and SentenceTransformer-derived caption embeddings.

Component-wise Similarity:

Block matching via bipartite Hungarian assignment: $S_{\mathrm{match}} = \frac{|\mathcal{M}^*|}{|\mathcal{C}_{GT}|}$ , where matched pairs $\mathcal{M}^*$ are defined over IoU (non-text) or string+position proximity (text).
Attribute-wise similarity:
- Position: $\mathrm{SIM}_{\mathrm{pos}} = \frac{1}{|\mathcal{M}^*|}\sum_{(i,j)\in\mathcal{M}^*} \left(1-\frac{\|c_i-c_j\|_2}{\max(d_i, d_j)}\right)$ ,
- Color: Laplacian distance on RGB vectors (for solid fills): $\mathrm{SIM}_{\mathrm{col}}$ ,
- Text: F1 score for matched text content.
Aggregate:

$\mathrm{SIM}_{\mathrm{comp}} = \tfrac14 \left( S_{\mathrm{match}} + \mathrm{SIM}_{\mathrm{pos}} + \mathrm{SIM}_{\mathrm{col}} + \mathrm{SIM}_{\mathrm{text}} \right )$

Metric aggregation is performed across tasks, either as an average over all instances (replication: $\frac1N\sum_i \mathrm{SIM}_m(s_t^{(i)},s_{GT}^{(i)})$ ) or as similarity change for modification tasks: $\Delta = \mathrm{SIM}(s_{new},s_{GT}) - \mathrm{SIM}(s_{old},s_{GT})$ .

Human alignment is analyzed through pairwise preference judgments (n=363), confirming that saliency, BLIPScore, and component matching are statistically significant predictors of human-perceived quality (Jeong et al., 25 Nov 2025).

4. Model Performance and Comparative Analysis

EditCanvas benchmarks a spectrum of leading VLMs, including GPT-4o, GPT-4.1, Gemini-2.5-Flash, Gemini-2.5-Pro, Claude-3.5-Sonnet, and others. Key findings are consolidated as follows:

Model	Replication (Comp-wise)	Modification (Comp-wise, Δ)
GPT-4o	0.671 ± 0.087	0.943 (+0.015)
GPT-4.1	0.716 ± 0.075 ★	0.951 (+0.024) ★
Gemini-2.5-Pro	0.694 ± 0.094	0.935 (+0.007)
Gemini-2.5-Flash	0.702 ± 0.100	0.948 (+0.020)
Claude-3.5-Sonnet	0.666 ± 0.089	0.946 (+0.018)

Replication: Gemini-2.5-Pro achieves the highest SSIM and saliency; GPT-4.1 is best on semantic (BLIPScore) and component-wise similarity.
Modification: GPT-4.1 leads across all metrics; some models exhibit negative $\Delta$ (over-editing that reduces fidelity).

A plausible implication is that different models excel at different signals: low-level visual accuracy (Gemini), semantic and logical manipulation (GPT-4.1). Tool-invocation diversity correlates positively with success in replication tasks (ρ ≈ +0.42), while precision and focused tool selection are more important for targeted modification tasks (ρ ≈ +0.15 for precision, ρ ≈ −0.37 for diversity) (Jeong et al., 25 Nov 2025).

5. Error Modes and Diagnostic Insights

Failure case analysis in EditCanvas reveals the following taxonomy and frequencies:

Geometric errors (~34%): Element count mismatches (e.g., missing repeated icons), incorrect path geometry, spatial misalignment.
Layout errors (~28%): Misuse of auto-layout (e.g., group disintegration), unintended reflows where child elements exceed parent bounds.
Text errors (~22%): Inadequate text boxes (overflow), erroneous line breaks, producing or omitting text nodes.
Other errors (~16%): Color mismatches and miscellaneous attribute assignment failures.

Remediation strategies include integrating constraints for explicit element counts (aggregate-primitive tools), enriching the training corpus with auto-layout semantics, and leveraging text-measurement APIs to resolve text box sizing prior to node resizing actions. This suggests that hybrid approaches combining symbolic constraints, model-based planning, and task-informed affordance engineering may offer fruitful directions for mitigating observed deficiencies (Jeong et al., 25 Nov 2025).

6. Significance, Limitations, and Future Directions

EditCanvas represents a rigorous, end-to-end benchmark for evaluating the capabilities of VLMs in agentic, tool-based UI design sequences. Its strengths include:

Realistic, professionally curated dataset spanning significant UI diversity and complexity.
Comprehensive multi-level metrics reflecting visual, semantic, and structural fidelity.
Alignment of automated metrics with human preferences.

Identified limitations include persistent geometric and layout manipulation errors, the challenge of strategic tool selection, and the gap between visual similarity metrics and true user intent. Future work is suggested in several directions:

Incorporate explicit reasoning over tool chains (e.g., plan-based or retrieval-augmented policies).
Expand domain coverage (e.g., desktop UIs, design systems) and support for complex attribute relations.
Advance in-the-loop learning with designer feedback and few-shot scenario adaptation.

EditCanvas stands as a representative protocol for the evaluation of VLMs in practical, iterative design collaborations facilitated by modern tool interfaces (Jeong et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EditCanvas Benchmark.