Papers
Topics
Authors
Recent
Search
2000 character limit reached

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

Published 9 Apr 2026 in cs.CV | (2604.08211v1)

Abstract: Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

Summary

  • The paper introduces SciFigDetect, the first systematic benchmark for detecting AI-generated scientific figures using an agent-based pipeline.
  • It outlines evaluation protocols including zero-shot, cross-generator, and robustness tests, revealing substantial detection challenges.
  • Results show significant detector overfitting to generator-specific artifacts and vulnerabilities to image degradation.

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

Motivation and Context

SciFigDetect addresses a critical gap in visual forensics, specifically the detection of AI-generated scientific figures. While generative models, particularly multimodal systems such as Nano Banana Pro and GPT-image-1.5, now synthesize publication-grade scientific illustrations, conventional detection benchmarks focus almost exclusively on open-domain imagery (faces, natural scenes, generic objects). Scientific figures pose distinct challengesโ€”they are structurally constrained, text-dense, and tightly aligned with academic semantics. This creates an urgent need for benchmarks and detection methods that account for the nuanced design and annotation conventions unique to scientific figures, especially as publishers increasingly restrict or prohibit AI-generated visuals due to integrity concerns.

Dataset Construction Pipeline

The paper introduces a comprehensive agent-based pipeline for constructing the SciFigDetect benchmark. Only papers released under commercially permissible licenses (e.g., CC BY) are used as source material. The pipeline comprises:

  • Understanding and Prompt Planning: Specialized agents segment papers, extract figure-relevant semantics from text and figures, and merge multimodal signals into structured prompts, capturing visual conventions and scientific content.
  • Generationโ€“Review Refinement Loop: Candidate figures are synthesized via Nano Banana Pro and GPT-image-1.5; a review agent scores outputs based on academic fidelity, aesthetic consistency, and logical coherence. Candidates below threshold are iteratively revised or regenerated.
  • Dataset Curation: Accepted samples include the paper context, real figure, synthetic counterpart, figure type, prompt, generator identity, and full provenance.

The resulting benchmark spans three figure types (Illustration, Overview, Experimental Figure) and four scientific domains. It comprises 72,965 real and 150,807 synthetic images, with aligned realโ€“synthetic pairs for direct comparison.

Experimental Protocol and Metrics

All images are normalized (PNG conversion, resolution alignment), color quantization/snap is applied to reduce nuisance variation, and splits occur at the paper level to prevent data leakage. Evaluation metrics are accuracy and average precision (AP), reported under the following protocols:

  • Zero-shot: Detectors pretrained on generic AIGI datasets, directly transferred to SciFigDetect.
  • Cross-generator: Single-generator or joint-generator training, tested on mixed generator distributions.
  • Degradation Robustness: Detectors are evaluated on images corrupted by compression (JPEG/WebP), blur, and noise, reflecting practical scenarios (re-rendering, screenshots, format conversion).

Detection Performance and Analysis

Zero-shot Transfer

Strong generalization failures are observed across all evaluated detectors. Even top-performing models such as LGrad only achieve ~53.68% mean accuracy, with most detectors biased toward predicting scientific figures as real (near-perfect accuracy on real, <3% on synthetic). This indicates that current detection paradigms, rooted in open-domain artifacts (frequency, texture, spatial structures, CLIP-vision embeddings), are ineffective for scientific diagrams, reflecting a substantial distributional gap.

Cross-Generator Generalization

Single-generator training results in pronounced generator-specific overfitting. Detectors trained on Nano Banana, for example, perform well on Banana-generated figures but very poorly on GPT-generated images, and vice versa. This failure mode persists even across aligned realโ€“synthetic pairs, highlighting distinct distributional artifacts. Joint training improves performance across both generators (Effort reaches 95.58% average accuracy), but this is not indicative of true generalizationโ€”detectors may still struggle on unseen generators with novel figure conventions.

Robustness to Image Degradation

Detectors degrade substantially under realistic post-processing corruptions. NPR drops from 93.96% accuracy on clean images to 75.94% under JPEG compression (q=30), and 50.49% under Gaussian noise (ฯƒ=20). Other models exhibit similar sensitivities to compression, blur, and noise, suggesting fragility in deployment scenarios where figures are routinely resaved, rendered, or transferred via lossy channels.

Implications and Future Directions

SciFigDetect establishes that publication-grade AI-generated scientific figures are largely unsolved as a detection target. The benchmark reveals significant weaknesses in current detection architectures: lack of generalization to structured, text-heavy diagrams; strong overfitting to seen generators; and fragility to post-processing noise. These findings imply that robust scientific figure forensics will require new detection paradigms, likely leveraging figure-specific semantic, structural, and textual cues rather than reliance on generic open-domain image artifacts. The dataset and evaluation protocols provide a foundational platform for advancing detection research in this space.

Future research directions include:

  • Architectures optimized for structured multimodal content (e.g., joint visionโ€“text embedding, figure context modeling).
  • Generator-agnostic detection strategies, potentially grounded in figure provenance, structural analysis, and annotation density metrics.
  • Adversarial robustness to common document-level image manipulations.
  • Integration of detection modules into automated publishing workflows and editorial platforms.

Conclusion

SciFigDetect represents the first systematic benchmark for AI-generated scientific figure detection, targeting an increasingly relevant integrity problem in academic publishing. Its agentic pipeline, realistic figure pairs, and comprehensive evaluations expose substantial gaps in current AIGI detection models. The benchmark will serve as a cornerstone for developing robust, context-aware detectors capable of addressing the evolving landscape of scientific illustration generation and visual forensics (2604.08211).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.