- The paper introduces SciFigDetect, the first systematic benchmark for detecting AI-generated scientific figures using an agent-based pipeline.
- It outlines evaluation protocols including zero-shot, cross-generator, and robustness tests, revealing substantial detection challenges.
- Results show significant detector overfitting to generator-specific artifacts and vulnerabilities to image degradation.
Motivation and Context
SciFigDetect addresses a critical gap in visual forensics, specifically the detection of AI-generated scientific figures. While generative models, particularly multimodal systems such as Nano Banana Pro and GPT-image-1.5, now synthesize publication-grade scientific illustrations, conventional detection benchmarks focus almost exclusively on open-domain imagery (faces, natural scenes, generic objects). Scientific figures pose distinct challengesโthey are structurally constrained, text-dense, and tightly aligned with academic semantics. This creates an urgent need for benchmarks and detection methods that account for the nuanced design and annotation conventions unique to scientific figures, especially as publishers increasingly restrict or prohibit AI-generated visuals due to integrity concerns.
Dataset Construction Pipeline
The paper introduces a comprehensive agent-based pipeline for constructing the SciFigDetect benchmark. Only papers released under commercially permissible licenses (e.g., CC BY) are used as source material. The pipeline comprises:
- Understanding and Prompt Planning: Specialized agents segment papers, extract figure-relevant semantics from text and figures, and merge multimodal signals into structured prompts, capturing visual conventions and scientific content.
- GenerationโReview Refinement Loop: Candidate figures are synthesized via Nano Banana Pro and GPT-image-1.5; a review agent scores outputs based on academic fidelity, aesthetic consistency, and logical coherence. Candidates below threshold are iteratively revised or regenerated.
- Dataset Curation: Accepted samples include the paper context, real figure, synthetic counterpart, figure type, prompt, generator identity, and full provenance.
The resulting benchmark spans three figure types (Illustration, Overview, Experimental Figure) and four scientific domains. It comprises 72,965 real and 150,807 synthetic images, with aligned realโsynthetic pairs for direct comparison.
Experimental Protocol and Metrics
All images are normalized (PNG conversion, resolution alignment), color quantization/snap is applied to reduce nuisance variation, and splits occur at the paper level to prevent data leakage. Evaluation metrics are accuracy and average precision (AP), reported under the following protocols:
- Zero-shot: Detectors pretrained on generic AIGI datasets, directly transferred to SciFigDetect.
- Cross-generator: Single-generator or joint-generator training, tested on mixed generator distributions.
- Degradation Robustness: Detectors are evaluated on images corrupted by compression (JPEG/WebP), blur, and noise, reflecting practical scenarios (re-rendering, screenshots, format conversion).
Zero-shot Transfer
Strong generalization failures are observed across all evaluated detectors. Even top-performing models such as LGrad only achieve ~53.68% mean accuracy, with most detectors biased toward predicting scientific figures as real (near-perfect accuracy on real, <3% on synthetic). This indicates that current detection paradigms, rooted in open-domain artifacts (frequency, texture, spatial structures, CLIP-vision embeddings), are ineffective for scientific diagrams, reflecting a substantial distributional gap.
Cross-Generator Generalization
Single-generator training results in pronounced generator-specific overfitting. Detectors trained on Nano Banana, for example, perform well on Banana-generated figures but very poorly on GPT-generated images, and vice versa. This failure mode persists even across aligned realโsynthetic pairs, highlighting distinct distributional artifacts. Joint training improves performance across both generators (Effort reaches 95.58% average accuracy), but this is not indicative of true generalizationโdetectors may still struggle on unseen generators with novel figure conventions.
Robustness to Image Degradation
Detectors degrade substantially under realistic post-processing corruptions. NPR drops from 93.96% accuracy on clean images to 75.94% under JPEG compression (q=30), and 50.49% under Gaussian noise (ฯ=20). Other models exhibit similar sensitivities to compression, blur, and noise, suggesting fragility in deployment scenarios where figures are routinely resaved, rendered, or transferred via lossy channels.
Implications and Future Directions
SciFigDetect establishes that publication-grade AI-generated scientific figures are largely unsolved as a detection target. The benchmark reveals significant weaknesses in current detection architectures: lack of generalization to structured, text-heavy diagrams; strong overfitting to seen generators; and fragility to post-processing noise. These findings imply that robust scientific figure forensics will require new detection paradigms, likely leveraging figure-specific semantic, structural, and textual cues rather than reliance on generic open-domain image artifacts. The dataset and evaluation protocols provide a foundational platform for advancing detection research in this space.
Future research directions include:
- Architectures optimized for structured multimodal content (e.g., joint visionโtext embedding, figure context modeling).
- Generator-agnostic detection strategies, potentially grounded in figure provenance, structural analysis, and annotation density metrics.
- Adversarial robustness to common document-level image manipulations.
- Integration of detection modules into automated publishing workflows and editorial platforms.
Conclusion
SciFigDetect represents the first systematic benchmark for AI-generated scientific figure detection, targeting an increasingly relevant integrity problem in academic publishing. Its agentic pipeline, realistic figure pairs, and comprehensive evaluations expose substantial gaps in current AIGI detection models. The benchmark will serve as a cornerstone for developing robust, context-aware detectors capable of addressing the evolving landscape of scientific illustration generation and visual forensics (2604.08211).