- The paper introduces a benchmark, PeerPrism, that analyzes 20,690 peer reviews to disentangle human evaluation (idea) from AI-generated text.
- It benchmarks seven detection methods across six review generation regimes, revealing inconsistencies in binary human-versus-AI attribution.
- The study highlights stylistic and semantic shifts, such as reduced first-person pronoun use and high cosine similarity, underscoring the need for nuanced attribution frameworks.
Disentangling Intellectual and Stylistic Provenance in Peer Review: An Analysis of PeerPrism
Introduction
The deployment of LLMs in scientific peer review has augmented the reviewing process through automated drafting, expansion, and stylistic refinement. However, prevailing detection systems treat authorship as a binary human-versus-AI problem, disregarding the continuum of collaboration in which evaluative reasoning (the "idea") and surface realization (the "text") may originate from different agents. The paper "PeerPrism: Peer Evaluation Expertise vs Review-writing AI" (2604.14513) introduces the PeerPrism benchmark, a systematic, large-scale dataset comprising 20,690 peer reviews, designed to disentangle idea provenance from text provenance and to rigorously evaluate current LLM detection methodologies under realistic hybrid authorship regimes.
Benchmark Construction and Generation Regimes
PeerPrism was constructed by mining reviews from OpenReview for ICLR and NeurIPS across 2021โ2024, capturing both pre- and post-LLM usage. Human reviews were selected to ensure temporal, venue, and acceptance status diversity. The benchmark incorporates six frontier LLMsโincluding OpenAI GPT-5, Gemini-2.5-Flash, and Llama-4-Scoutโto generate reviews under diverse regimes, recognizing the necessity for high-fidelity, long-context, and domain-specific critique unmatched by smaller models.
The authors operationalize six review generation regimes:
- Fully Synthetic (AIย Ideas/AIย Text): LLMs generate reviews independently, with no human input.
- Rewritten (Humanย Idea/AIย Text): The LLM rewrites a human review, altering only style, not content.
- Extract Regenerate (Humanย Idea/AIย Text): An LLM extracts structured evaluative content from a human review and manuscript, then regenerates the review solely from this semantic core.
- Expanded (Mixedย Idea/AIย Text): An LLM expands a human review, maintaining its core but introducing new machine-generated arguments.
- Hybrid Augmentation (Mixedย Idea/AIย Text): An LLM synthesizes a human review and multiple LLM-generated critiques, yielding a genuinely mixed idea provenance.
- Original Human Reviews (Humanย Idea/Humanย Text): Serve as the gold standard.
Each entry is labeled by both idea and text origin, permitting multidimensional attribution analysis.
Detector Benchmarking: Binary Versus Provenance-Aware Evaluation
The study benchmarks seven state-of-the-art detection methodologies including GLTR, DetectGPT, Fast-DetectGPT, Lastde++, Binoculars, RADAR, and the context-aware Anchor system. These methods span likelihood-based, perturbation-based, embedding-based, and supervised paradigms. Detector thresholds were calibrated without fine-tuning, ensuring generalizability.
On the standard binary human-versus-synthetic split, Anchor and Fast-DetectGPT achieve high overall accuracy (90โ95%) on fully synthetic reviews, confirming that contemporary detectors can identify purely LLM-authored reviews under controlled settings. However, detector performance is heavily generator-dependent, particularly for likelihood-based and supervised classifiers, with significant degradation observed on long-context, high-fluency reviews generated by models such as Gemini-2.5 or Claude-Haiku-4.5.
Figure 1: Detector prediction breakdown by review generation regimes, displaying divergence across hybrid review classes.
When evaluated on hybrid provenance regimes, detector predictions fragment sharply (Figure 1). For example, in the rewritten regime, GLTR predicts 64.9% of reviews as AI whereas Fast-DetectGPT classifies 74.2% as Human, and Anchor outputs yet another distribution. In the expanded and hybrid regimes, contradictions intensify; detectors alternate between surface-style detection (text origin) and latent semantic detection (idea origin). The study asserts that surface statistical signals confound deeper attribution of evaluative reasoning, and thus binary attribution collapses under realistic mixed-authorship scenarios.
Stylometric and Semantic Provenance Analysis
PeerPrism enables quantification of stylometric and semantic shifts across review regimes. The following findings are emphasized:
- Lexical and Rhetorical Shift: Machine-generated reviews exhibit increased lexical diversity and syntactic complexity, with a marked reduction in first-person pronouns (0.37 per review for synthetic versus 5.04 for human), signaling a shift towards impersonal, formal rhetoric ("objectivity bias").
- Semantic Inheritance: Transformed reviews (rewritten, extract regenerate, hybrid) exhibit high cosine similarity (0.92) to their human seeds, substantiating that LLMs, when constrained to preserve content, maintain evaluative logic and argument structure even as stylistic features are overwritten.
- Citation and Engagement Markers: Transformed reviews often increase manuscript references and explicit section pointers relative to both synthetic and human reviews, reflecting both inherited and amplified scholarly engagement.
These analyses elucidate a multidimensional provenance space: stylistic realization can be machine-like while intellectual critique remains human-aligned, and vice versa.
Theoretical and Practical Implications
The principal findingโthat detector predictions diverge or contradict when faced with hybrid reviewsโundermines the assumption that binary LLM attribution is tractable or meaningful in high-stakes, collaborative scientific workflows. The study argues that extant detectors implicitly solve distinct tasks (surface style versus semantic reasoning detection) without explicit modeling, leading to misleading or incommensurate results under real-world LLM usage patterns.
From a practical perspective, reliance on these systems in peer review governance risks misclassifying valid human expertise rendered through AI-aided writing, or, conversely, failing to identify reviews with superficial human polish but hollow machine-derived reasoning. The theoretical implication is that authorship in contemporary knowledge work must be modeled as a continuum, demanding provenance quantification frameworks capable of resolving the interplay between semantic source and linguistic surface.
Prospects for Future Development
PeerPrism lays the groundwork for future research directions, including:
- Quantitative Provenance Estimation: Moving beyond binary classification to probabilistically estimate degrees of semantic and stylistic contribution per review segment.
- Instance-Level Attribution: Finer-grained, phrase- or sentence-level attribution of human and AI influence.
- Robustness to Model Progression: Continuous benchmarking across the evolving LLM landscape, accommodating increases in fluency and human-likeness.
- Ethical and Policy Considerations: Development of standards and disclosure policies that reflect multidimensional authorship, mitigating institutional, reputational, and governance risks.
Conclusion
PeerPrism (2604.14513) demonstrates that LLM detection in scientific peer review is a multidimensional attribution problem in which authorship cannot be reliably reduced to a binary human-versus-AI distinction. Detector predictions under hybrid authorship regimes are inconsistent and often contradictory, reflecting the conflation of surface stylistic signals with deeper intellectual provenance. These findings necessitate a paradigm shift: future detection and evaluation mechanisms must quantify and disentangle semantic and textual provenance, enabling both rigorous measurement and principled stewardship of human-AI collaboration in high-impact evaluative settings.