Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Published 5 Apr 2026 in cs.CV, cs.AI, and cs.LG | (2604.04192v2)

Abstract: We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a comprehensive benchmark, Graphic-Design-Bench, that evaluates AI across 49 design tasks including layout, typography, SVG graphics, template semantics, and animation.
The paper employs detailed component-level annotations from the LICA dataset, using metrics such as mIoU, SSIM, and CLIPScore to reveal critical performance deficits in spatial reasoning and typographic fidelity.
The paper demonstrates that current multimodal models perform poorly on structured design challenges, underscoring the need for specialized pretraining and domain-calibrated evaluation frameworks.

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Introduction and Motivation

The "Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks" systematically addresses the deficit of structured, design-native evaluation frameworks for AI in the professional graphic design domain. Unlike prior benchmarks focused on natural-image understanding or generic text-to-image generation, GraphicDesignBench (GDB) targets the high-dimensional setting of real-world layered design, accounting for composite layouts, typographic fidelity, structured vector graphics, and animation—all of which embody non-trivial, multiscale constraints that conventional metrics and datasets fail to capture.

Benchmark Design and Structure

GDB is constructed on the LICA layered-composition dataset, which provides component-level annotations, including bounding boxes, type, z-order, typography specs, vector metadata, and animation parameters. This enables explicit evaluation of both perception and generation tasks in a manner that is fundamentally inaccessible with flat raster datasets, thereby facilitating tasks such as partial layout completion, typography extraction, style-consistent template variation, and temporally structured video synthesis.

Figure 1: LICA samples from the core benchmark set illustrate the diversity of layouts, component hierarchies, and task-relevant annotations present in the corpus.

GDB consists of 49 tasks across five axes: layout, typography, SVG vector (infographics), template/semantics, and animation, each assessed under understanding and generation paradigms, yielding both domain breadth and task granularity. Metrics are selected from a suite of design-specific quantitative measures including mIoU, SSIM, CLIPScore, OCR, JSON/SVG validity, and human-aligned preference models (NIMA, HPSv3, M-Judge), as well as new diagnostics for compositional fidelity.

Figure 2: Example design templates illustrating the spectrum of layout properties—aspect ratio diversity, frame/crop configuration, spatial complexity, and decorative treatments.

Comparative Evaluation of Frontier Models

A spectrum of multimodal LLMs and generative backbones (Gemini-3.1, GPT-5.4/1.5, Claude-Opus-4.6, Sora, Veo) are used to baseline GDB. Each model is evaluated for both perception (e.g., type/class/position recognition, intent classification) and synthesis (e.g., intent-to-layout generation, style completion, Lottie/SVG code generation) under standardized prompt and API-driven settings.

Key findings:

Substantial performance deficits: Across nearly all fine-grained design tasks, state-of-the-art models produce outputs far from usable performance. For example, in layout component detection, the leading model achieves only $6.4\%$ [email protected]—orders of magnitude below natural-image detection baselines. Typography tasks, such as font family classification, reach only $23.7\%$ top-1 accuracy across 167 classes, with highly skewed macro-F1, while text color $\Delta E$ ranges up to 52 units on weak models.
Task stratification: High-level semantic tasks (e.g., coarse template categorization with constrained vocabulary) are in the "partially solved" regime (>70% accuracy). However, as tasks require compositional or structure-dependent outputs (multi-element completion, layered inpainting, animation keyframe ordering), performance quickly degrades.
Intersectional limitations: Tasks requiring precise spatial grounding, typographic recovery, SVG code synthesis, or temporal structure (animation) universally expose bottlenecks. LLMs display systematic errors: element overcounting by orders of magnitude, consistent confusion in geometric reasoning, and failure to confine edits to prescribed regions.

Figure 3: Representative failure cases for layout understanding—including aspect ratio misclassification, severe overcounting, type collapse.

Figure 4: Layer order prediction failure—incorrect z-ordering that disrupts layout usability.

Figure 5: Partial layout completion—models fail at multiple-object placement, ignoring contextual cropping and introducing unnatural object positioning.

For generation tasks, such as intent-to-layout, high-level metrics (CLIP, PickScore, ImageReward) can mask critical deficiencies in text fidelity, hierarchy, and usability, emphasizing the necessity for domain-calibrated ground truth and human preference circuits. Similarly, in aspect-ratio retargeting, models diverge markedly in asset recall, hallucination rates, and text preservation, demonstrating complementary but incomplete strengths.

Domain-Specific Results

Layout and Spatial Reasoning

Despite advances in multi-modal perception, current AI models cannot reliably parse or synthesize structured layouts. Tasks probing component localization, stacking, frame/crop inference, and multi-aspect adaptation reveal unresolved gaps, especially in multi-element and structurally entangled contexts.

Typography

Extraction and rendering of typographically faithful text remain unsolved in the majority of sub-tasks. Fine-grained fonts, colors, weights, alignments, and styled spans are not comprehensively recoverable. Styled text generation in layout-constrained regions is particularly fraught: models frequently spill beyond mask boundaries, hallucinate or reflow text content, or modify nearby non-target elements, making the required precision for real-world editing unattainable.

Figure 6: Visualization of typography attributes—font family, size, weight, color, alignment, spacing, and rotation.

Figure 7: Typography task failures—misclassification of font category, color inversion, and non-detection of curved text.

Figure 8: Text parameter misprediction: incorrect sizing and placement lead to visual misalignments and composition defects.

Infographics (SVG/Lottie) and Structured Code Generation

SVG understanding and editing tasks show that models are more consistent in perceptual and semantic Q/A from code (up to 93.7% accuracy on semantic questions), but bug fixing and multi-operation edits encounter non-trivial error rates and output invalidity. In SVG and Lottie generation, compositional fidelity is strongly input-modality dependent; text-to-SVG yields schematic but not pixel-faithful results, while image-to-SVG improves up to 0.918 SSIM (GPT-5.4), but with persistent geometric/gradient inaccuracies. Lottie generation exposes severe breakdowns in animation structure, compositional layering, and sequence control.

Figure 9: Text-to-SVG and image-to-SVG generation examples exhibit failures in background assignment and artifact introduction, even in the presence of explicit textual or visual specification.

Figure 10: Image-to-SVG across element types; even top-performing models miss central features (holes, labels) and have difficulty with gradient mapping.

Figure 11: Lottie generation incapacity—none of the evaluated models reproduce animation structure or layer choreography correctly.

Template Semantics

Model performance in free-text user intent assignment saturates near the practical ceiling of current annotation granularity. However, template variant understanding exposes a preference for superficial (font similarity, palette) over deeper structural alignment—non-LLM feature-based baselines perform competitively in clustering and retrieval, exceeding LLMs in some cases. Category classification is highly sensitive to label constraint; unconstrained, models fail due to label aliasing rather than core perceptual deficits.

Figure 12: Open-vocabulary classification exposes label aliasing challenges.

Figure 13: Task structure for template variant understanding—matching, ranking, clustering across groups.

Figure 14: Failure cases where shared palettes or surface content dominate over core structure.

Animation and Video

Temporal tasks (keyframe ordering, motion type classification, interval estimation, specification-grounded animation synthesis) are uniformly in the "unsolved" regime. Models outperform random guessing, but exact matches and per-component accuracy remain weak, especially in multi-element scenes. Even with explicit parameterization and a unique ground-truth association for each component, grounding and motion individuality remain unachievable.

Figure 15: Sample overview of canonical animation motion types.

Implications and Forward Directions

GDB’s results delineate the boundary between current AI capabilities and the precise, context-dependent structure required for design collaboration. The research makes several claims grounded in large-scale empirical analysis:

Current multimodal LLMs exhibit critical limitations in spatial reasoning, typographic fidelity, structure-aware synthesis, and compositional animation control that are not apparent on standard vision-language or image-generation benchmarks.
Performance in high-level semantic tasks (template/intent recognition, open-vocab tasks) saturate under label-constrained regimes but do not translate into precise, actionable generation or editing competence.
Effective, extensible design benchmarks must incorporate structured, domain-aligned metrics rather than solely relying on FID/CLIPScore or pixelwise proxies, and must explicitly report failure under regime shifts (e.g., composition complexity, style diversification, temporal structure).

GDB is constructed as a reproducible, extensible evaluation framework. This structure is designed to facilitate robust assessment across both closed- and open-source models, stimulate the creation of specialized design task training curricula, and prioritize directionality in capability transfer (e.g., advancing from typography understanding to multi-element layout generation). Substantial progress requires design-specialized pre-training, high-fidelity context encoding, hierarchical structural supervision, and richer input/output interface design (structured tokens, functional APIs, etc.).

Conclusion

GDB establishes a comprehensive, multidimensional testbed that rigorously exposes the structural, perceptual, and compositional requirements for AI collaboration in professional graphic design. Only 2 out of 49 tasks are characterized as mostly solved; the majority expose broad, critical capability gaps unobservable in existing image or VQA benchmarks. Future directions must address both modeling (structural pretraining, compositional reasoning architectures) and evaluation (domain-expert human judgment, actionable error taxonomy, open-source baselines) dimensions. Bridging these gaps is essential for reliable, designer-facing AI integration in practical visual communication workflows.

Markdown Report Issue