PaperBananaBench Benchmark

Updated 2 February 2026

The paper introduces a large-scale benchmark with 584 curated cases from NeurIPS 2025 to evaluate automated generation of publication-ready academic illustrations.
It employs a reference-driven protocol using VLMs to assess faithfulness, conciseness, readability, and aesthetics in varied methodology diagram styles.
Experimental results show significant performance gains with Nano-Banana-Pro, highlighting tradeoffs between technical precision and visual appeal.

PaperBananaBench is a large-scale, curated benchmark designed to rigorously evaluate the automated generation of publication-ready academic illustrations—primarily methodology diagrams—from state-of-the-art generative frameworks. Introduced in the context of the PaperBanana agentic framework, PaperBananaBench consists of 584 cases sampled from NeurIPS 2025 papers, split into 292 reference and 292 test cases. It provides comprehensive coverage of research domains and illustration styles, with a structured evaluation protocol encompassing faithfulness, conciseness, readability, and aesthetics using VLMs as adjudicators (Zhu et al., 30 Jan 2026).

1. Benchmark Composition and Curation

PaperBananaBench was constructed via a systematic pipeline applied to 2,000 randomly sampled NeurIPS 2025 PDFs. The process included parsing for methodology text, figures, and captions using MinerU; filtering papers lacking methodology diagrams (reducing candidates to 1,359); aspect-ratio filtering to select only diagrams with width:height in 1.5, 2.5; then categorizing into four domain categories via a VLM:

Agent Reasoning
Vision Perception
Generative Learning
Science Applications

Manual human curation ensured extraction accuracy, domain label validation, and exclusion of diagrams deemed overly simplistic, cluttered, or abstract. The final pool of 584 methodology diagrams was equally split into reference and test sets (292 each).

Illustration style diversity spans schematic pipeline diagrams (rounded boxes, pastel fills) to detailed agent “cartoon” motifs and volumetric 3D-stack icon diagrams. The benchmark’s appendix introduces a synthesized “Method Diagram Aesthetics Guide,” summarizing acceptable color palettes, shape and typography rules, and domain-specific iconography requirements.

2. Evaluation Protocol and Metrics

The primary formulation is expressed as:

$I = f(S, C)$ for baseline illustration synthesis,
$I = f(S, C, \mathcal{E})$ , $\mathcal{E}\subset$ reference examples, for reference-driven illustration.

Scoring spans four dimensions:

Faithfulness: Module alignment with the method section/caption, absence of hallucinated or contradictory elements.
Conciseness: Maximized signal-to-noise visual ratio; constraint of fewer than 15 words per box and exclusion of raw equation dumps.
Readability: Clear navigation flow (left-right/top-down), legible text, no overlap or excessive arrows.
Aesthetics: Professional visual polish in line with top AI conference standards; avoidance of artifact-laden rendering, non-standard colors, or black backgrounds.

Each test case is judged side-by-side against human reference by Gemini-3-Pro (VLM). Scores are assigned on a dimension-wise basis (model: 100, human: 0, tie: 50), with outcome aggregation using a hierarchical protocol prioritizing Faithfulness and Readability. Inter-model agreement scores (Kendall’s $\tau$ ) and human alignment statistics confirm moderate concordance (Faithfulness $\tau=0.43$ –$0.51$, Conciseness $\tau=0.47$ –$0.60$).

3. Domain Distribution and Performance Analysis

Performance-weighted overall scores for the principal domain categories illustrate a clear gradient:

Domain	Overall Score (%)
Agent Reasoning	69.9
Science Applications	58.8
Generative Learning	57.0
Vision Perception	52.1

Main experiment results in Table 1 (means over 292 cases) reveal PaperBanana with Nano-Banana-Pro achieves highest overall score (60.2), surpassing human reference (50.0) and all baseline frameworks—demonstrating strengths, especially in Conciseness and Aesthetics ( $+37.2$ , $+6.6$ improvements over baseline Nano-Banana-Pro).

Method	Faithfulness	Conciseness	Readability	Aesthetics	Overall
Nano-Banana-Pro (zero-shot)	43.0	43.5	38.5	65.5	43.2
PaperBanana + NBP	45.8	80.7	51.4	72.1	60.2
PaperBanana + GPT-Image	16.0	65.0	33.0	56.0	19.0
Human Reference	50.0	50.0	50.0	50.0	50.0

Score variance is not reported.

4. Extension to Statistical Plots

PaperBananaBench was extended to evaluate statistical plot generation using a curated ChartMimic subset (914 plots, 480 sampled as test cases). Category distribution: Bar, Line, Tree/Pie, Scatter, Heatmap, Radar, Miscellaneous. The same reference-based protocol applies, with prompts adjusted for numeric fidelity.

Quantitative results show PaperBanana outperforms vanilla Gemini-3-Pro by:

Faithfulness: $+1.4$ \%
Conciseness: $+5.0$ \%
Readability: $+3.1$ \%
Aesthetics: $+4.0$ \%
Overall: $+4.1$ \%

A tradeoff analysis (Figure 1) notes that code-based synthesis (Matplotlib) ranks highest for Faithfulness/Conciseness but produces visually sparse plots, while image-gen systems yield better Readability/Aesthetics but less accuracy on dense data.

5. Strengths, Limitations, and Key Insights

Key findings from comparative experiments:

Reference-driven planning coupled with style transfer and iterative Visualizer–Critic feedback loops substantially improve outcomes over zero/few-shot baselines.
The Stylist Agent boosts Conciseness and Aesthetics but may omit technical detail, mitigated by Critic Agent refinements.
Even random Retriever dramatically enhances diagram layout structure.

Identified limitations:

Raster output restricts easy post-generation editing versus vector formats.
Style-guide standardization reduces visual diversity.
Faithfulness at fine granularity (e.g., arrow direction, endpoint accuracy) remains error-prone, with Critic Agent sometimes failing to detect errors.
Evaluation bounds: VLM-based reference scoring may overlook subtle topological errors due to lack of explicit graph-based metrics.

6. Prospects and Future Directions

Recommended improvements entail:

Editable outputs: Integration of vector backend or GUI agents for platform-specific postprocessing.
Dynamic style adaptation: End-user preference specification or multi-style candidate rendering at inference.
Advanced VLM perception: Training of connection-/structure-aware vision models; explicit graph-structure comparison for error detection.
Enriched evaluation: Inclusion of learned metrics for aesthetic or structural fidelity, or hybrid human-AI scoring on challenging cases.
Broader applicability: Generalization of reference-driven, style-parsing pipelines to domains such as UI/UX, patent schematics, industrial diagrams.

PaperBananaBench establishes a reproducible and diverse yardstick for evaluating illustration generation in autonomous research workflows, with clear diagnostic power across faithfulness, conciseness, readability, and aesthetics. It enables fine-grained comparison and analysis of generative agentic frameworks under realistic, high-stakes conditions and points to algorithmic frontiers in automated academic visualization (Zhu et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

PaperBanana: Automating Academic Illustration for AI Scientists (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PaperBananaBench.