BadScientist: LLM Paper Fabrication Analysis

Updated 6 December 2025

BadScientist Framework is an evaluation system that analyzes vulnerabilities in LLM-driven research by fabricating and reviewing AI-generated papers.
It employs a modular pipeline combining paper generation, multi-model review, and rigorous statistical calibration to validate manuscript integrity.
The framework exposes a concern–acceptance conflict where reviewers flag issues yet accept unsound papers, urging the need for robust integrity safeguards.

The BadScientist framework is an evaluation system designed to analyze the vulnerability of LLM-based research agents and automated peer review systems to paper fabrication attacks. It provides both a modular pipeline for the end-to-end generation and review of research papers composed entirely without authentic experiments, and a rigorous statistical evaluation methodology to assess whether AI-generated, unsound papers can successfully pass through contemporary multi-model LLM review workflows. The framework exposes structural weaknesses in automated academic publishing processes and motivates the development of more robust integrity defense mechanisms (Jiang et al., 20 Oct 2025).

1. System Architecture and Pipeline

BadScientist is architected around three core modules: (1) a Paper Generation Agent $\mathcal{G}$ , (2) a Review Agent $\mathcal{R}$ comprising multiple LLM reviewers, and (3) an Analysis/Aggregation module $\mathcal{A}$ for calibration, aggregation, and statistical error guarantee computation.

The framework pipeline is represented as:

Seed Prompt $(t,s)$ : Specifies topic $t$ and attack strategy $s$ .
Data Synthesizer $q(D|t,s)$ : Generates pseudo‐experimental data conditioned on $(t,s)$ .
Visualization Module $\mathrm{viz}(D)$ : Renders plots/tables from $D$ .
Manuscript Composer $\mathcal{R}$ 0: Assembles a fully structured LaTeX manuscript.
Review Agent $\mathcal{R}$ 1: Queries multiple LLMs per paper using a fixed rubric, aggregates individual rubric vectors $\mathcal{R}$ 2 and textual comments $\mathcal{R}$ 3.
Calibration/Aggregation $\mathcal{R}$ 4: Uses real paper data for threshold calibration and computes statistical concentration bounds.

Manuscript output is required to satisfy the constraint $\mathcal{R}$ 5, ensuring compilability and structural correctness.

High-level agent pseudocode:

$s$ 2

2. Presentation-Manipulation Strategies

BadScientist operationalizes five atomic attack strategies for paper fabrication, as well as their joint application:

TooGoodGains ( $\mathcal{R}$ 6): Artificially amplifies performance improvements over state-of-the-art (SOTA).
BaselineSelect ( $\mathcal{R}$ 7): Selectively reports weaker baselines and omits confidence intervals.
StatTheater ( $\mathcal{R}$ 8): Constructs sophisticated statistical tables and $\mathcal{R}$ 9-values that create an illusion of validity.
CoherencePolish ( $\mathcal{A}$ 0): Focuses on producing flawless document structure, consistent notation, and professional typography.
ProofGap ( $\mathcal{A}$ 1): Inserts "rigorous" proofs concealing subtle logical gaps.
All: Applies all $\mathcal{A}$ 2– $\mathcal{A}$ 3 strategies simultaneously.

Example of a fabricated table generated under $\mathcal{A}$ 4 (TooGoodGains):

$s$ 3

Example of a misleading fabricated loss curve:

$s$ 4

3. Formal Evaluation and Error Guarantees

The framework adopts a mathematically rigorous approach to aggregate review scores, calibrate thresholds, and provide formal error bounds.

Notation:

Paper space: $\mathcal{A}$ 5
Strategies: $\mathcal{A}$ 6; Topics: $\mathcal{A}$ 7
Generator distribution:

$\mathcal{A}$ 8

Review Models: $\mathcal{A}$ 9; rubric output $(t,s)$ 0.
Aggregate score:

$(t,s)$ 1

Concentration Bound (Theorem 1 — Bernstein-McDiarmid):

Given centered, vector sub-Gaussian rubric outputs and $(t,s)$ 2-Lipschitz aggregation:

$(t,s)$ 3

where the terms control variance and worst-case deviations among LLM reviewers.

For binary acceptance predictors with margin $(t,s)$ 4,

$(t,s)$ 5

Calibration Error Bounds (Propositions 1 & 2):

Use the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality to bound deviations between empirical and true accept rates.
Isotonic regression estimates the threshold where estimated acceptance probability crosses 0.5, with explicit error bounds in terms of the regression slope and estimation error.

4. Experimental Results and Empirical Vulnerabilities

BadScientist reveals systematic weaknesses in LLM-based review:

Strategy	ACPT $(t,s)$ 6	ACPT $(t,s)$ 7	ICR $(t,s)$ 8	ICR $(t,s)$ 9	ICR $t$ 0	ICR@M
$t$ 1	67.0%	82.0%	38.4%	4.7%	2.3%	39.5%
$t$ 2	32.0%	49.0%	35.2%	4.5%	2.3%	35.2%
$t$ 3	53.5%	69.7%	29.4%	2.4%	4.7%	31.8%
$t$ 4	44.0%	59.0%	28.2%	5.9%	1.2%	30.6%
$t$ 5	35.4%	53.5%	25.9%	8.2%	7.1%	34.1%
All	52.0%	69.0%	50.6%	5.7%	8.0%	51.7%

Acceptance rates for fabricated papers reach up to 82% for specific strategies.
Integrity Concern Rate (ICR) varies by reviewer model and attack method; many papers flagged for integrity still receive acceptance-level scores.
Score distributions demonstrate that reviewer models differ in their score tendencies (e.g., o4-mini is more lenient, o3 is more variable, GPT-4.1 is stricter near the threshold).

5. Concern–Acceptance Conflict

BadScientist identifies the phenomenon of concern–acceptance conflict: LLM reviewers often flag papers for integrity concerns in their free-form comments, yet still assign scores above the acceptance threshold.

Conflict rate is defined as the fraction of papers where a reviewer both raises an integrity concern and recommends acceptance. For example:

Model	$t$ 6	$t$ 7	$t$ 8	$t$ 9	$s$ 0	All
o3	33.3%	25.8%	52.0%	30.0%	40.9%	29.5%
o4-mini	100.0%	50.0%	100.0%	80.0%	71.9%	100.0%
GPT-4.1	50.0%	50.0%	75.0%	0.0%	33.3%	57.1%

Despite recognizing the unsound nature of fabricated submissions, LLM reviewers systematically fail to translate these concerns into actual rejections.

6. Detection and Mitigation Attempts

Two principal mitigation strategies are evaluated:

Review-with-Detection (ReD): LLM reviewers return their scores, a binary "AI-Fabricated/Non-Fabricated" label, and quoted evidence.
Detection-Only (DetOnly): Reviewers provide only the binary detection label with supporting evidence.

Performance metrics include true positive rate (TPR), false positive rate (FPR), accuracy (Acc), and $s$ 1 score. Even with explicit prompts for fabrication detection, detection rates barely surpass random guessing. ReD sometimes results in higher acceptance rates compared to when no explicit detection prompt is given.

Detection results summary:

Method	o3 Acc	o4-mini Acc	GPT-4.1 Acc
Random	50.0%	50.0%	50.0%
ReD	67.0%	46.0%	50.0%
DetOnly	57.0%	45.0%	56.0%

A plausible implication is that prompting LLMs for integrity checking in isolation is not effective in reliably identifying fabricated submissions (Jiang et al., 20 Oct 2025).

7. Implications and Safeguards

Key findings are:

Automated AI-driven publication pipelines are fundamentally susceptible to fabrication: with certain attack strategies, up to 82% acceptance is observed.
Concern–acceptance conflict undermines integrity signals, as LLM reviewers both flag and accept unsound work.
Statistical aggregation and calibration, even with sound mathematical guarantees, fail to provide adequate defense against these attacks.

Recommended safeguards include:

Defense-in-depth such as provenance verification (e.g., artifact timestamping, data checkpoints).
Integrating integrity-weighted scoring, where acceptance is conditioned on the absence of flagged concerns.
Mandatory human review for papers near or above the concern threshold.
Audit logging of reviewer model actions and transparent post-publication review.
Pursuing future research in adversarial reviewer training, multimodal credible-interval reasoning, and rigorous cross-validation with authentic experiments.

These results highlight urgent limitations of current AI-driven scientific publishing and the need for robust, multi-layered integrity verification systems (Jiang et al., 20 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BadScientist Framework.