Measuring and evaluating design AI systems

Develop principled methodologies for measuring and evaluating graphic design AI systems, addressing open questions about how design capabilities should be quantified and compared across tasks and models.

Background

In the Discussion section, the authors reflect on limitations of the current benchmark and state that these limitations point to open problems in measurement and evaluation for design AI.

They list concrete gaps—such as the insufficiency of pixel-level metrics, lack of large-scale human (designer) evaluation, absence of open-source baselines, difficulty evaluating near-zero performance regimes, and incomplete coverage of diverse design contexts—motivating the need for better methodologies.

References

Despite its breadth, GDB surfaces several limitations that point to open problems in how design AI should be measured and evaluated.

Graphic-Design-Bench: A Comprehensive Benchmark for Evaluating AI on Graphic Design Tasks  (2604.04192 - Deganutti et al., 5 Apr 2026) in Discussion, Section 7 (Evaluation Gaps)