Papers
Topics
Authors
Recent
Search
2000 character limit reached

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Published 5 Apr 2026 in cs.CV, cs.AI, and cs.LG | (2604.04192v2)

Abstract: We introduce GraphicDesignBench (GDB), the first comprehensive benchmark suite designed specifically to evaluate AI models on the full breadth of professional graphic design tasks. Unlike existing benchmarks that focus on natural-image understanding or generic text-to-image synthesis, GDB targets the unique challenges of professional design work: translating communicative intent into structured layouts, rendering typographically faithful text, manipulating layered compositions, producing valid vector graphics, and reasoning about animation. The suite comprises 50 tasks organized along five axes: layout, typography, infographics, template & design semantics and animation, each evaluated under both understanding and generation settings, and grounded in real-world design templates drawn from the LICA layered-composition dataset. We evaluate a set of frontier closed-source models using a standardized metric taxonomy covering spatial accuracy, perceptual quality, text fidelity, semantic alignment, and structural validity. Our results reveal that current models fall short on the core challenges of professional design: spatial reasoning over complex layouts, faithful vector code generation, fine-grained typographic perception, and temporal decomposition of animations remain largely unsolved. While high-level semantic understanding is within reach, the gap widens sharply as tasks demand precision, structure, and compositional awareness. GDB provides a rigorous, reproducible testbed for tracking progress toward AI systems that can function as capable design collaborators. The full evaluation framework is publicly available.

Summary

  • The paper introduces a comprehensive benchmark, Graphic-Design-Bench, that evaluates AI across 49 design tasks including layout, typography, SVG graphics, template semantics, and animation.
  • The paper employs detailed component-level annotations from the LICA dataset, using metrics such as mIoU, SSIM, and CLIPScore to reveal critical performance deficits in spatial reasoning and typographic fidelity.
  • The paper demonstrates that current multimodal models perform poorly on structured design challenges, underscoring the need for specialized pretraining and domain-calibrated evaluation frameworks.

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks

Introduction and Motivation

The "Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks" systematically addresses the deficit of structured, design-native evaluation frameworks for AI in the professional graphic design domain. Unlike prior benchmarks focused on natural-image understanding or generic text-to-image generation, GraphicDesignBench (GDB) targets the high-dimensional setting of real-world layered design, accounting for composite layouts, typographic fidelity, structured vector graphics, and animation—all of which embody non-trivial, multiscale constraints that conventional metrics and datasets fail to capture.

Benchmark Design and Structure

GDB is constructed on the LICA layered-composition dataset, which provides component-level annotations, including bounding boxes, type, z-order, typography specs, vector metadata, and animation parameters. This enables explicit evaluation of both perception and generation tasks in a manner that is fundamentally inaccessible with flat raster datasets, thereby facilitating tasks such as partial layout completion, typography extraction, style-consistent template variation, and temporally structured video synthesis. Figure 1

Figure 1: LICA samples from the core benchmark set illustrate the diversity of layouts, component hierarchies, and task-relevant annotations present in the corpus.

GDB consists of 49 tasks across five axes: layout, typography, SVG vector (infographics), template/semantics, and animation, each assessed under understanding and generation paradigms, yielding both domain breadth and task granularity. Metrics are selected from a suite of design-specific quantitative measures including mIoU, SSIM, CLIPScore, OCR, JSON/SVG validity, and human-aligned preference models (NIMA, HPSv3, M-Judge), as well as new diagnostics for compositional fidelity. Figure 2

Figure 2

Figure 2

Figure 2: Example design templates illustrating the spectrum of layout properties—aspect ratio diversity, frame/crop configuration, spatial complexity, and decorative treatments.

Comparative Evaluation of Frontier Models

A spectrum of multimodal LLMs and generative backbones (Gemini-3.1, GPT-5.4/1.5, Claude-Opus-4.6, Sora, Veo) are used to baseline GDB. Each model is evaluated for both perception (e.g., type/class/position recognition, intent classification) and synthesis (e.g., intent-to-layout generation, style completion, Lottie/SVG code generation) under standardized prompt and API-driven settings.

Key findings:

  • Substantial performance deficits: Across nearly all fine-grained design tasks, state-of-the-art models produce outputs far from usable performance. For example, in layout component detection, the leading model achieves only 6.4%6.4\% [email protected]—orders of magnitude below natural-image detection baselines. Typography tasks, such as font family classification, reach only 23.7%23.7\% top-1 accuracy across 167 classes, with highly skewed macro-F1, while text color ΔE\Delta E ranges up to 52 units on weak models.
  • Task stratification: High-level semantic tasks (e.g., coarse template categorization with constrained vocabulary) are in the "partially solved" regime (>70% accuracy). However, as tasks require compositional or structure-dependent outputs (multi-element completion, layered inpainting, animation keyframe ordering), performance quickly degrades.
  • Intersectional limitations: Tasks requiring precise spatial grounding, typographic recovery, SVG code synthesis, or temporal structure (animation) universally expose bottlenecks. LLMs display systematic errors: element overcounting by orders of magnitude, consistent confusion in geometric reasoning, and failure to confine edits to prescribed regions. Figure 3

Figure 3

Figure 3

Figure 3: Representative failure cases for layout understanding—including aspect ratio misclassification, severe overcounting, type collapse.

Figure 4

Figure 4

Figure 4: Layer order prediction failure—incorrect z-ordering that disrupts layout usability.

Figure 5

Figure 5: Partial layout completion—models fail at multiple-object placement, ignoring contextual cropping and introducing unnatural object positioning.

For generation tasks, such as intent-to-layout, high-level metrics (CLIP, PickScore, ImageReward) can mask critical deficiencies in text fidelity, hierarchy, and usability, emphasizing the necessity for domain-calibrated ground truth and human preference circuits. Similarly, in aspect-ratio retargeting, models diverge markedly in asset recall, hallucination rates, and text preservation, demonstrating complementary but incomplete strengths.

Domain-Specific Results

Layout and Spatial Reasoning

Despite advances in multi-modal perception, current AI models cannot reliably parse or synthesize structured layouts. Tasks probing component localization, stacking, frame/crop inference, and multi-aspect adaptation reveal unresolved gaps, especially in multi-element and structurally entangled contexts.

Typography

Extraction and rendering of typographically faithful text remain unsolved in the majority of sub-tasks. Fine-grained fonts, colors, weights, alignments, and styled spans are not comprehensively recoverable. Styled text generation in layout-constrained regions is particularly fraught: models frequently spill beyond mask boundaries, hallucinate or reflow text content, or modify nearby non-target elements, making the required precision for real-world editing unattainable. Figure 6

Figure 6: Visualization of typography attributes—font family, size, weight, color, alignment, spacing, and rotation.

Figure 7

Figure 7

Figure 7

Figure 7: Typography task failures—misclassification of font category, color inversion, and non-detection of curved text.

Figure 8

Figure 8

Figure 8: Text parameter misprediction: incorrect sizing and placement lead to visual misalignments and composition defects.

Infographics (SVG/Lottie) and Structured Code Generation

SVG understanding and editing tasks show that models are more consistent in perceptual and semantic Q/A from code (up to 93.7% accuracy on semantic questions), but bug fixing and multi-operation edits encounter non-trivial error rates and output invalidity. In SVG and Lottie generation, compositional fidelity is strongly input-modality dependent; text-to-SVG yields schematic but not pixel-faithful results, while image-to-SVG improves up to 0.918 SSIM (GPT-5.4), but with persistent geometric/gradient inaccuracies. Lottie generation exposes severe breakdowns in animation structure, compositional layering, and sequence control. Figure 9

Figure 9: Text-to-SVG and image-to-SVG generation examples exhibit failures in background assignment and artifact introduction, even in the presence of explicit textual or visual specification.

Figure 10

Figure 10: Image-to-SVG across element types; even top-performing models miss central features (holes, labels) and have difficulty with gradient mapping.

Figure 11

Figure 11: Lottie generation incapacity—none of the evaluated models reproduce animation structure or layer choreography correctly.

Template Semantics

Model performance in free-text user intent assignment saturates near the practical ceiling of current annotation granularity. However, template variant understanding exposes a preference for superficial (font similarity, palette) over deeper structural alignment—non-LLM feature-based baselines perform competitively in clustering and retrieval, exceeding LLMs in some cases. Category classification is highly sensitive to label constraint; unconstrained, models fail due to label aliasing rather than core perceptual deficits. Figure 12

Figure 12

Figure 12: Open-vocabulary classification exposes label aliasing challenges.

Figure 13

Figure 13: Task structure for template variant understanding—matching, ranking, clustering across groups.

Figure 14

Figure 14: Failure cases where shared palettes or surface content dominate over core structure.

Animation and Video

Temporal tasks (keyframe ordering, motion type classification, interval estimation, specification-grounded animation synthesis) are uniformly in the "unsolved" regime. Models outperform random guessing, but exact matches and per-component accuracy remain weak, especially in multi-element scenes. Even with explicit parameterization and a unique ground-truth association for each component, grounding and motion individuality remain unachievable. Figure 15

Figure 15: Sample overview of canonical animation motion types.

Implications and Forward Directions

GDB’s results delineate the boundary between current AI capabilities and the precise, context-dependent structure required for design collaboration. The research makes several claims grounded in large-scale empirical analysis:

  • Current multimodal LLMs exhibit critical limitations in spatial reasoning, typographic fidelity, structure-aware synthesis, and compositional animation control that are not apparent on standard vision-language or image-generation benchmarks.
  • Performance in high-level semantic tasks (template/intent recognition, open-vocab tasks) saturate under label-constrained regimes but do not translate into precise, actionable generation or editing competence.
  • Effective, extensible design benchmarks must incorporate structured, domain-aligned metrics rather than solely relying on FID/CLIPScore or pixelwise proxies, and must explicitly report failure under regime shifts (e.g., composition complexity, style diversification, temporal structure).

GDB is constructed as a reproducible, extensible evaluation framework. This structure is designed to facilitate robust assessment across both closed- and open-source models, stimulate the creation of specialized design task training curricula, and prioritize directionality in capability transfer (e.g., advancing from typography understanding to multi-element layout generation). Substantial progress requires design-specialized pre-training, high-fidelity context encoding, hierarchical structural supervision, and richer input/output interface design (structured tokens, functional APIs, etc.).

Conclusion

GDB establishes a comprehensive, multidimensional testbed that rigorously exposes the structural, perceptual, and compositional requirements for AI collaboration in professional graphic design. Only 2 out of 49 tasks are characterized as mostly solved; the majority expose broad, critical capability gaps unobservable in existing image or VQA benchmarks. Future directions must address both modeling (structural pretraining, compositional reasoning architectures) and evaluation (domain-expert human judgment, actionable error taxonomy, open-source baselines) dimensions. Bridging these gaps is essential for reliable, designer-facing AI integration in practical visual communication workflows.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 17 likes about this paper.