GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

Published 5 Apr 2026 in cs.CV and cs.AI | (2604.04172v1)

Abstract: In many science papers, "Figure 1" serves as the primary visual summary of the core research idea. These figures are visually simple yet conceptually rich, often requiring significant effort and iteration by human authors to get right, highlighting the difficulty of science visual communication. With this intuition, we introduce GENFIG1, a benchmark for generative AI models (e.g., Vision-LLMs). GENFIG1 evaluates models for their ability to produce figures that clearly express and motivate the central idea of a paper (title, abstract, introduction, and figure caption) as input. Solving GENFIG1 requires more than producing visually appealing graphics: the task entails reasoning for text-to-image generation that couples scientific understanding with visual synthesis. Specifically, models must (i) comprehend and grasp the technical concepts of the paper, (ii) identify the most salient ones, and (iii) design a coherent and aesthetically effective graphic that conveys those concepts visually and is faithful to the input. We curate the benchmark from papers published at top deep-learning conferences, apply stringent quality control, and introduce an automatic evaluation metric that correlates well with expert human judgments. We evaluate a suite of representative models on GENFIG1 and demonstrate that the task presents significant challenges, even for the best-performing systems. We hope this benchmark serves as a foundation for future progress in multimodal AI.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a benchmark (GenFig1) for visual summarization in scientific figure generation using a dataset of nearly 10,000 scholarly figure-text tuples.
The paper evaluates multiple generation paradigms including zero-shot prompting, Chain-of-Thought, and SVG synthesis, demonstrating significant discrepancies between human and model outputs.
The paper highlights that current vision-language models fail to capture detailed compositional and spatial nuances, urging the need for enhanced multimodal reasoning capabilities.

GenFig1: A Benchmark for Visual Summarization in Scientific Figure Generation

Introduction

The GenFig1 benchmark (2604.04172) introduces a rigorous, large-scale dataset and evaluation protocol for assessing the scientific reasoning and visual synthesis capabilities of vision-LLMs (VLMs). Unlike traditional text-to-image platforms designed for general visual domains, the GenFig1 task targets a high-value, unconstrained subproblem: generating the archetypal "Figure 1" from a scholarly manuscript, given the title, abstract, introduction, and the figure caption. These Figure 1s function as distilled graphical abstracts, concise overviews, or exemplars of the central technical contribution. The generation of such figures requires models to perform genuine scientific abstraction and selective visual rendering, a task that demands both multimodal and domain-specific intelligence.

Figure 1: Representative instances from GenFig1, including original human-created figures, VLM zero-shot generations, Chain-of-Thought, SVG pipelines, and multi-image assembly; task performance is highly dependent on conceptual understanding and layout fidelity.

Dataset Construction and Taxonomy

GenFig1 comprises nearly 10,000 tuples from peer-reviewed publications in the top AI, ML, CV, and NLP venues (2020–2025). Each tuple consists of the full Figure 1, the title, abstract, introduction, and figure caption, allowing the evaluation of full-document summarization to diagrammatic synthesis. The dataset creation pipeline includes extensive source collection, canonical mapping to arXiv LaTeX sources, abstract/introduction/caption/figure extraction, and stringent post-hoc filtering. In particular, quantitative result figures (charts, metrics, etc.) are excluded in favor of visual blueprints or technical exemplars—tasks that require abstract technical reasoning rather than straightforward mapping from results.

Taxonomic analysis, derived from manual annotation over a representative sample, reveals three main categories: Overview (with Model Architecture and Method as subtypes), Example (Background, Method, Dataset), and Experimental Results (filtered out for the generative task). Distributional statistics indicate that Example—Method and Example—Background dominate, while Overview diagrams (pipelines, architectures) are also prevalent.

Figure 2: Extracted taxonomy of Figure 1s, showing the predominance of Method and Background exemplars, with architectural and conceptual overviews forming a secondary cluster.

Figure 3: Visual exemplars illustrating the defined taxonomy classes, highlighting diversity across domains.

The dataset analysis shows significant style overlap between venues, with computer vision conferences forming a marginally distinct cluster in embedding space, but overall substantial commonality in figure design conventions.

Figure 4: CLIP-based visual embeddings of figures, projected by UMAP, with minor clustering by venue/field but large shared latent structure.

Baselines and Evaluation

The study rigorously benchmarks multiple generation paradigms:

Zero-shot image generation: Direct prompt-to-image mapping using a VLM (GPT-Image-1).
Chain-of-Thought (CoT): Decomposition into taxonomy identification, layout reasoning, and region specification before final rendering.
Chain-of-Images (CoI): Region-wise sequential generation, with assembly after individual region rendering.
Text-to-SVG and CoT-SVG: SVG code synthesis (GPT-4o mini), both direct and hierarchical.

Diffusion and TikZ code-based pipelines are also considered but found nonfunctional/unsupported at scale.

Automatic evaluation leverages both reference-based (DINOv2 visual embedding similarity) and reference-free (VLM-as-a-Judge, text-rich catastrophic neglect) metrics. The VLM-as-a-Judge rubric scores clarity, faithfulness, information density, interestingness, aesthetics, and legibility using GPT-4.1, with strong correlation to expert human rankings.

Empirical Analysis

Strong and consistent results emerge:

Human vs. Model Gap: All automatic and VLM-based pipelines underperform humans across all metrics—e.g., faithfulness (humans: 99.5, best model: 85.2), clarity, and legibility. The mean reciprocal rank (MRR) for human figures (0.787) substantially outpaces VLM generations (e.g., zero-shot: 0.453).
Image-based vs. SVG-based: Text-to-image approaches significantly outperform SVG code synthesis in information density, faithfulness, and overall human preference.
CoT/CoI vs. Zero-shot: Contrary to typical LLM reasoning, the zero-shot baseline yields higher scores than decomposition pipelines; the added complexity does not improve alignment, likely due to propagation of layout errors and compounding visual dissonance.
Metric Validity: Among automatic metrics, VLM-as-a-Judge shows the highest correlation with human judgments (Spearman’s $\rho$ up to 0.79 for info density), whereas embedding similarity is only moderately correlated.

Failure analysis reveals that VLMs, while sometimes capturing broad technical themes, often mishandle spatial composition, truncate key content, or misalign visual semantics with the context.

Figure 5: Examples of human and VLM-generated figures, illustrating superior compositionality and technical faithfulness by humans even under similar input constraints.

Implications and Future Directions

GenFig1 provides evidence that controllable, abstract, and scientifically faithful figure generation lies outside the capabilities of current state-of-the-art VLMs and hybrid architectures. The inability of multi-step reasoning (CoT, region-wise assembly) to outperform direct zero-shot prompting indicates intrinsic limitations—not simply insufficient prompt engineering or inadequate sample diversity, but lack of deep text-to-diagram translation and abstraction capabilities.

This result grounds the hypothesis that scientific visual communication demands a higher-order multimodal reasoning than general photorealistic image synthesis. Furthermore, the dataset and evaluation protocol expose weaknesses in current evaluation metrics for multimodal scientific reasoning, underscoring the importance of VLM-as-a-Judge approaches and emphasizing the need for robust, human-aligned metrics.

From a practical perspective, GenFig1 establishes a precise, reproducible benchmark and a canonical, high-quality testbed for generative AI methods targeting scientific communication. The explicit disaggregation of the figure taxonomy can seed future development of conditionally controllable or domain-adaptive figure generation models. The formalization of automatic metrics and their correlation with expert preference enable scalable auditing and comparison of future advances.

Theoretically, the documented gap between human and machine performance invites investigation into new architectures for multimodal abstraction, compositional spatial parsing, and the explicit representation of mathematical/algorithmic relationships. The failure of SVG and code-based synthesis relative to pixel-based image generation suggests that domain-specific priors or hybrid symbolic-connectionist pipelines may be necessary for progress.

Conclusion

GenFig1 positions itself as the de facto benchmark for visual scientific summarization. By systematizing the generation and evaluation of Figure 1 as a challenge for vision-LLMs, it reveals both the current limitations and the fundamental requirements for future scientific AI. The benchmark, the associated metrics, and the empirical findings guide the development of next-generation models—those that aspire not only to produce plausible images but to genuinely translate technical complexity into visual clarity and conceptual fidelity.

Markdown Report Issue