PaperBanana: Automated Scientific Illustrations

Updated 2 February 2026

The paper introduces an agentic framework for automated academic illustrations by integrating retrieval, content planning, style guidance, and iterative refinement.
It achieves superior performance by improving faithfulness, conciseness, readability, and aesthetics compared to baseline models.
The study also extends the approach to statistical plot generation and discusses related innovations in energy devices and geometric analysis.

PaperBanana refers to several distinct but technically significant constructs spanning automated scientific illustration (PaperBanana framework), sustainable flexible electronics (banana-derived supercapacitors), and advanced geometric or analytic structures (banana integrals, banana-shaped actuators). While the term may arise independently in each context, the following exposition prioritizes the most prominent recent usage as an agentic illustration-generation framework for AI scientists, followed by an overview of related concepts in scientific energy storage, soft material mechanics, and mathematical physics.

1. Automated Academic Illustration: The PaperBanana Framework

The PaperBanana framework is an agentic, reference-driven system for automated generation of publication-ready academic illustrations, introduced to address the bottleneck of manual figure drafting in AI-augmented research environments (&&&0&&&). The architecture decomposes the illustration pipeline into specialized autonomous agents orchestrated in modular sequence, leveraging state-of-the-art vision–LLMs (VLMs) and diffusion-based image generators.

System Architecture and Key Agents:

Retriever Agent performs generative retrieval of top- $N$ reference example triplets $(S_i, C_i, I_i)$ (with $S_i$ being a methodology segment, $C_i$ a caption, and $I_i$ the figure) from a reference repository, conditioning on proposed context $S$ and intent $C$ .
Planner (Content Planner) uses a VLM (e.g., Gemini-3-Pro) to transform $(S, C, \mathcal{E})$ into a structured content plan $P$ —an explicit specification of diagram entities and relations:

$P = \mathrm{VLM}_{\mathrm{plan}}(S,\,C,\,(S_i, C_i, I_i)_{i=1}^N)$

Stylist (Style Planner) creates an “Aesthetic Guideline” $\mathcal{G}$ by summarizing style features across the reference set, then refines $P$ into a fully specified, style-conformant plan $P^*$ :

$P^* = \mathrm{VLM}_{\mathrm{style}}(P,\,\mathcal{G})$

Visualizer (Image Renderer) maps the textual plan $P_t$ at each iterative step $t$ into a raster diagram $I_t$ using fine-tuned scientific figure diffusion models (Nano-Banana-Pro) or generalist generators (GPT-Image-1.5).
Critic (Self-Critic Agent) performs multimodal analysis of $I_t$ against $(S, C)$ , issuing a suggested revision $P_{t+1}$ for further refinement.

Iterative Refinement Loop: $\text{For } t = 0 \ldots T-1: \quad I_t = \mathrm{ImageGen}(P_t) \quad P_{t+1} = \mathrm{VLM}_{\mathrm{critic}}(I_t, S, C, P_t)$ with final output $I_T$ . This loop operationalizes model-based self-critique, functionally analogous to a discrete “gradient” update in description space: $P_{t+1} = P_t + \alpha\,\nabla_{P}\,\mathcal{C}\bigl(\mathrm{ImageGen}(P_t);S,C\bigr)$ where $\mathcal{C}$ scores faithfulness, conciseness, and aesthetics.

2. Content and Style Planning Mechanisms

The content planner’s output $P$ encodes diagram semantics as a set of nodes (modules, data artifacts) and directed edges (flow relations), e.g. $(n_j, \mathrm{shape}_j, \mathrm{label}_j)$ , $(e_{jk}, \mathrm{arrow\text{-}style}_{jk})$ . This specification supports unambiguous mapping to schematic illustration backends. The style planner constructs the aesthetic guideline $\mathcal{G}$ by aggregating statistics on color palettes, shape motifs (e.g., rounded versus sharp corners), line conventions (solid, dashed, orthogonal, or curved trajectories), and typographic rules (mathematics in serif, labels in sans-serif). The plan $P^*$ is then enriched with explicit HEX codes, font metrics, and parameterized visual instructions enforced through prompt engineering.

3. Evaluation Using PaperBananaBench

Benchmarking is accomplished via PaperBananaBench, containing 292 meticulously curated NeurIPS 2025 methodology-diagram cases and 292 held-out reference examples for agentic retrieval. Domain categories include Agent Reasoning, Vision Perception, Generative Learning, and Science Applications. Evaluation employs a VLM-based Judge (Gemini-3-Pro) that compares generated diagrams $I$ to human references $I^{\mathrm{ref}}$ across:

Faithfulness: semantic preservation relative to $S,C$
Conciseness: lack of superfluous or redundant elements
Readability: clarity and legibility of diagram components
Aesthetics: adherence to prevailing visual norms

The scoring is categorical: 100 (“Model win”), 50 (“Tie”), 0 (“Human win”). Aggregation prioritizes faithfulness and readability.

Method	Faithfulness	Conciseness	Readability	Aesthetics	Overall
Nano-Banana-Pro	43.0	43.5	38.5	65.5	43.2
PaperBanana (ours)	45.8	80.7	51.4	72.1	60.2

Ablation results confirm criticality of each agent: removing Retriever (-16 points overall), Stylist (-17.5% conciseness), or Critic (-15.8% faithfulness) each substantially degrade performance.

4. Extension to Statistical Plot Generation

PaperBanana generalizes to the automatic production of statistical plots by extending the Visualizer agent with a code generator, which translates plan $P_t$ into executable scripts (e.g., Python/Matplotlib). The Critic then analyzes generated plots in conjunction with raw data, updating $P_{t+1}$ to correct mis-specifications. On the ChartMimic direct mimic suite (240 cases, 7 plot types), PaperBanana surpasses a Gemini-3-Pro code-generation baseline by +1.4% (faithfulness), +5.0% (conciseness), +3.1% (readability), and +4.0% (aesthetics), while matching human plot faithfulness and marginally exceeding human performance in other dimensions.

5. Underlying Models, Optimization Objectives, and Systemic Significance

The VLM backbone (Gemini-3-Pro) is central for all high-level planning and critique subroutines. Image generators are diffusion models trained on scientific illustration corpora; Nano-Banana-Pro, in particular, offers high diagram structural fidelity. Training objectives for the generators employ a mean squared diffusion denoising loss: $\mathcal{L}_{\mathrm{gen}} = \mathbb{E}_{x_0,\epsilon\sim\mathcal{N}(0,I),t} \left\|\epsilon - \epsilon_\theta(x_t,t)\right\|^2\,,$ with $x_t = \sqrt{\alpha_t}x_0 + \sqrt{1-\alpha_t}\epsilon$ .

The framework is deployed without additional fine-tuning; all style adaptation is handled by in-context prompting.

a. Supercapacitor from Banana-Peel Biomass (“PaperBanana” Device, Editor’s term)

Activated carbon derived from banana peel using KOH activation ( $62\,\mathrm{m}^2/\mathrm{g}$ BET area) is employed in a flexible, interdigitated supercapacitor fabricated on PET via screen-printed Ag electrodes and drop-cast PVA/H $_3$ PO $_4$ gel electrolyte (Singh et al., 2019). The device delivers areal capacitances up to $33.18\,\mathrm{mF/cm}^2$ , energy density $5.87\,\mu\mathrm{Wh/cm}^2$ , and $\sim 90\%$ retention under 5000 cycles and mechanical bending, with scalability enabled by low-cost, waste-derived carbon and roll-to-roll compatible screen printing.

b. Isometric Deformations in Soft Matter: Banana-Shaped Seedpod

The “folded Goursat” family analytically characterizes isometric deformations with folds in thin shells, inspired by banana-shaped seedpods (Couturier, 2016). These geometric constructs allow for controllable actuation (closing or opening) determined by fold placement, with optimization favoring elongated morphologies for ease of opening and minimized mechanical cost.

c. Banana Integrals in Mathematical Physics

Multi-loop “banana” Feynman integrals admit descriptions in terms of periods of K3 surfaces, with modular/automorphic properties determined by mass configuration. Maximal cuts of three-loop banana integrals lead to explicit orthogonal modular forms, Hilbert/Siegel/Hermitian modular forms, and factorized elliptic expressions, structured by the transcendental lattice and associated monodromy groups (Duhr, 21 Feb 2025).

7. Concluding Synthesis and Implications

The principal PaperBanana framework illustrates the synergistic integration of retrieval-augmented VLMs, image/text planning, automated aesthetic induction, iterative self-critique, and scientific benchmarking, collectively enabling an agentic pipeline for the generation and refinement of high-fidelity, publication-grade scientific illustrations. Its extension to statistical plotting substantiates its generality. Related developments in energy storage and geometry—where “banana” refers to highly engineered morphologies as well as specialized integrals—demonstrate the breadth of technical interpretations, with significant impact arising from agentic automation in research workflows, scalable biointegrated devices, and advanced analytic structures in physical mathematics (Zhu et al., 30 Jan 2026, Singh et al., 2019, Couturier, 2016, Duhr, 21 Feb 2025).