PaperBananaBench Benchmark
- The paper introduces a large-scale benchmark with 584 curated cases from NeurIPS 2025 to evaluate automated generation of publication-ready academic illustrations.
- It employs a reference-driven protocol using VLMs to assess faithfulness, conciseness, readability, and aesthetics in varied methodology diagram styles.
- Experimental results show significant performance gains with Nano-Banana-Pro, highlighting tradeoffs between technical precision and visual appeal.
PaperBananaBench is a large-scale, curated benchmark designed to rigorously evaluate the automated generation of publication-ready academic illustrations—primarily methodology diagrams—from state-of-the-art generative frameworks. Introduced in the context of the PaperBanana agentic framework, PaperBananaBench consists of 584 cases sampled from NeurIPS 2025 papers, split into 292 reference and 292 test cases. It provides comprehensive coverage of research domains and illustration styles, with a structured evaluation protocol encompassing faithfulness, conciseness, readability, and aesthetics using VLMs as adjudicators (Zhu et al., 30 Jan 2026).
1. Benchmark Composition and Curation
PaperBananaBench was constructed via a systematic pipeline applied to 2,000 randomly sampled NeurIPS 2025 PDFs. The process included parsing for methodology text, figures, and captions using MinerU; filtering papers lacking methodology diagrams (reducing candidates to 1,359); aspect-ratio filtering to select only diagrams with width:height in 1.5, 2.5; then categorizing into four domain categories via a VLM:
- Agent Reasoning
- Vision Perception
- Generative Learning
- Science Applications
Manual human curation ensured extraction accuracy, domain label validation, and exclusion of diagrams deemed overly simplistic, cluttered, or abstract. The final pool of 584 methodology diagrams was equally split into reference and test sets (292 each).
Illustration style diversity spans schematic pipeline diagrams (rounded boxes, pastel fills) to detailed agent “cartoon” motifs and volumetric 3D-stack icon diagrams. The benchmark’s appendix introduces a synthesized “Method Diagram Aesthetics Guide,” summarizing acceptable color palettes, shape and typography rules, and domain-specific iconography requirements.
2. Evaluation Protocol and Metrics
The primary formulation is expressed as:
- for baseline illustration synthesis,
- , reference examples, for reference-driven illustration.
Scoring spans four dimensions:
- Faithfulness: Module alignment with the method section/caption, absence of hallucinated or contradictory elements.
- Conciseness: Maximized signal-to-noise visual ratio; constraint of fewer than 15 words per box and exclusion of raw equation dumps.
- Readability: Clear navigation flow (left-right/top-down), legible text, no overlap or excessive arrows.
- Aesthetics: Professional visual polish in line with top AI conference standards; avoidance of artifact-laden rendering, non-standard colors, or black backgrounds.
Each test case is judged side-by-side against human reference by Gemini-3-Pro (VLM). Scores are assigned on a dimension-wise basis (model: 100, human: 0, tie: 50), with outcome aggregation using a hierarchical protocol prioritizing Faithfulness and Readability. Inter-model agreement scores (Kendall’s ) and human alignment statistics confirm moderate concordance (Faithfulness –$0.51$, Conciseness –$0.60$).
3. Domain Distribution and Performance Analysis
Performance-weighted overall scores for the principal domain categories illustrate a clear gradient:
| Domain | Overall Score (%) |
|---|---|
| Agent Reasoning | 69.9 |
| Science Applications | 58.8 |
| Generative Learning | 57.0 |
| Vision Perception | 52.1 |
Main experiment results in Table 1 (means over 292 cases) reveal PaperBanana with Nano-Banana-Pro achieves highest overall score (60.2), surpassing human reference (50.0) and all baseline frameworks—demonstrating strengths, especially in Conciseness and Aesthetics (, improvements over baseline Nano-Banana-Pro).
| Method | Faithfulness | Conciseness | Readability | Aesthetics | Overall |
|---|---|---|---|---|---|
| Nano-Banana-Pro (zero-shot) | 43.0 | 43.5 | 38.5 | 65.5 | 43.2 |
| PaperBanana + NBP | 45.8 | 80.7 | 51.4 | 72.1 | 60.2 |
| PaperBanana + GPT-Image | 16.0 | 65.0 | 33.0 | 56.0 | 19.0 |
| Human Reference | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
Score variance is not reported.
4. Extension to Statistical Plots
PaperBananaBench was extended to evaluate statistical plot generation using a curated ChartMimic subset (914 plots, 480 sampled as test cases). Category distribution: Bar, Line, Tree/Pie, Scatter, Heatmap, Radar, Miscellaneous. The same reference-based protocol applies, with prompts adjusted for numeric fidelity.
Quantitative results show PaperBanana outperforms vanilla Gemini-3-Pro by:
- Faithfulness: \%
- Conciseness: \%
- Readability: \%
- Aesthetics: \%
- Overall: \%
A tradeoff analysis (Figure 1) notes that code-based synthesis (Matplotlib) ranks highest for Faithfulness/Conciseness but produces visually sparse plots, while image-gen systems yield better Readability/Aesthetics but less accuracy on dense data.
5. Strengths, Limitations, and Key Insights
Key findings from comparative experiments:
- Reference-driven planning coupled with style transfer and iterative Visualizer–Critic feedback loops substantially improve outcomes over zero/few-shot baselines.
- The Stylist Agent boosts Conciseness and Aesthetics but may omit technical detail, mitigated by Critic Agent refinements.
- Even random Retriever dramatically enhances diagram layout structure.
Identified limitations:
- Raster output restricts easy post-generation editing versus vector formats.
- Style-guide standardization reduces visual diversity.
- Faithfulness at fine granularity (e.g., arrow direction, endpoint accuracy) remains error-prone, with Critic Agent sometimes failing to detect errors.
- Evaluation bounds: VLM-based reference scoring may overlook subtle topological errors due to lack of explicit graph-based metrics.
6. Prospects and Future Directions
Recommended improvements entail:
- Editable outputs: Integration of vector backend or GUI agents for platform-specific postprocessing.
- Dynamic style adaptation: End-user preference specification or multi-style candidate rendering at inference.
- Advanced VLM perception: Training of connection-/structure-aware vision models; explicit graph-structure comparison for error detection.
- Enriched evaluation: Inclusion of learned metrics for aesthetic or structural fidelity, or hybrid human-AI scoring on challenging cases.
- Broader applicability: Generalization of reference-driven, style-parsing pipelines to domains such as UI/UX, patent schematics, industrial diagrams.
PaperBananaBench establishes a reproducible and diverse yardstick for evaluating illustration generation in autonomous research workflows, with clear diagnostic power across faithfulness, conciseness, readability, and aesthetics. It enables fine-grained comparison and analysis of generative agentic frameworks under realistic, high-stakes conditions and points to algorithmic frontiers in automated academic visualization (Zhu et al., 30 Jan 2026).