Papers
Topics
Authors
Recent
Search
2000 character limit reached

PeerPrism: Peer Evaluation Expertise vs Review-writing AI

Published 16 Apr 2026 in cs.CL | (2604.14513v1)

Abstract: LLMs are increasingly used in scientific peer review, assisting with drafting, rewriting, expansion, and refinement. However, existing peer-review LLM detection methods largely treat authorship as a binary problem-human vs. AI-without accounting for the hybrid nature of modern review workflows. In practice, evaluative ideas and surface realization may originate from different sources, creating a spectrum of human-AI collaboration. In this work, we introduce PeerPrism, a large-scale benchmark of 20,690 peer reviews explicitly designed to disentangle idea provenance from text provenance. We construct controlled generation regimes spanning fully human, fully synthetic, and multiple hybrid transformations. This design enables systematic evaluation of whether detectors identify the origin of the surface text or the origin of the evaluative reasoning. We benchmark state-of-the-art LLM text detection methods on PeerPrism. While several methods achieve high accuracy on the standard binary task (human vs. fully synthetic), their predictions diverge sharply under hybrid regimes. In particular, when ideas originate from humans but the surface text is AI-generated, detectors frequently disagree and produce contradictory classifications. Accompanied by stylometric and semantic analyses, our results show that current detection methods conflate surface realization with intellectual contribution. Overall, we demonstrate that LLM detection in peer review cannot be reduced to a binary attribution problem. Instead, authorship must be modeled as a multidimensional construct spanning semantic reasoning and stylistic realization. PeerPrism is the first benchmark evaluating human-AI collaboration in these settings. We release all code, data, prompts, and evaluation scripts to facilitate reproducible research at https://github.com/Reviewerly-Inc/PeerPrism.

Summary

  • The paper introduces a benchmark, PeerPrism, that analyzes 20,690 peer reviews to disentangle human evaluation (idea) from AI-generated text.
  • It benchmarks seven detection methods across six review generation regimes, revealing inconsistencies in binary human-versus-AI attribution.
  • The study highlights stylistic and semantic shifts, such as reduced first-person pronoun use and high cosine similarity, underscoring the need for nuanced attribution frameworks.

Disentangling Intellectual and Stylistic Provenance in Peer Review: An Analysis of PeerPrism

Introduction

The deployment of LLMs in scientific peer review has augmented the reviewing process through automated drafting, expansion, and stylistic refinement. However, prevailing detection systems treat authorship as a binary human-versus-AI problem, disregarding the continuum of collaboration in which evaluative reasoning (the "idea") and surface realization (the "text") may originate from different agents. The paper "PeerPrism: Peer Evaluation Expertise vs Review-writing AI" (2604.14513) introduces the PeerPrism benchmark, a systematic, large-scale dataset comprising 20,690 peer reviews, designed to disentangle idea provenance from text provenance and to rigorously evaluate current LLM detection methodologies under realistic hybrid authorship regimes.

Benchmark Construction and Generation Regimes

PeerPrism was constructed by mining reviews from OpenReview for ICLR and NeurIPS across 2021โ€“2024, capturing both pre- and post-LLM usage. Human reviews were selected to ensure temporal, venue, and acceptance status diversity. The benchmark incorporates six frontier LLMsโ€”including OpenAI GPT-5, Gemini-2.5-Flash, and Llama-4-Scoutโ€”to generate reviews under diverse regimes, recognizing the necessity for high-fidelity, long-context, and domain-specific critique unmatched by smaller models.

The authors operationalize six review generation regimes:

  • Fully Synthetic (AIย Ideas/AIย Text\text{AI Ideas} / \text{AI Text}): LLMs generate reviews independently, with no human input.
  • Rewritten (Humanย Idea/AIย Text\text{Human Idea} / \text{AI Text}): The LLM rewrites a human review, altering only style, not content.
  • Extract Regenerate (Humanย Idea/AIย Text\text{Human Idea} / \text{AI Text}): An LLM extracts structured evaluative content from a human review and manuscript, then regenerates the review solely from this semantic core.
  • Expanded (Mixedย Idea/AIย Text\text{Mixed Idea} / \text{AI Text}): An LLM expands a human review, maintaining its core but introducing new machine-generated arguments.
  • Hybrid Augmentation (Mixedย Idea/AIย Text\text{Mixed Idea} / \text{AI Text}): An LLM synthesizes a human review and multiple LLM-generated critiques, yielding a genuinely mixed idea provenance.
  • Original Human Reviews (Humanย Idea/Humanย Text\text{Human Idea} / \text{Human Text}): Serve as the gold standard.

Each entry is labeled by both idea and text origin, permitting multidimensional attribution analysis.

Detector Benchmarking: Binary Versus Provenance-Aware Evaluation

The study benchmarks seven state-of-the-art detection methodologies including GLTR, DetectGPT, Fast-DetectGPT, Lastde++, Binoculars, RADAR, and the context-aware Anchor system. These methods span likelihood-based, perturbation-based, embedding-based, and supervised paradigms. Detector thresholds were calibrated without fine-tuning, ensuring generalizability.

On the standard binary human-versus-synthetic split, Anchor and Fast-DetectGPT achieve high overall accuracy (90โ€“95%) on fully synthetic reviews, confirming that contemporary detectors can identify purely LLM-authored reviews under controlled settings. However, detector performance is heavily generator-dependent, particularly for likelihood-based and supervised classifiers, with significant degradation observed on long-context, high-fluency reviews generated by models such as Gemini-2.5 or Claude-Haiku-4.5. Figure 1

Figure 1: Detector prediction breakdown by review generation regimes, displaying divergence across hybrid review classes.

When evaluated on hybrid provenance regimes, detector predictions fragment sharply (Figure 1). For example, in the rewritten regime, GLTR predicts 64.9% of reviews as AI whereas Fast-DetectGPT classifies 74.2% as Human, and Anchor outputs yet another distribution. In the expanded and hybrid regimes, contradictions intensify; detectors alternate between surface-style detection (text origin) and latent semantic detection (idea origin). The study asserts that surface statistical signals confound deeper attribution of evaluative reasoning, and thus binary attribution collapses under realistic mixed-authorship scenarios.

Stylometric and Semantic Provenance Analysis

PeerPrism enables quantification of stylometric and semantic shifts across review regimes. The following findings are emphasized:

  • Lexical and Rhetorical Shift: Machine-generated reviews exhibit increased lexical diversity and syntactic complexity, with a marked reduction in first-person pronouns (0.37 per review for synthetic versus 5.04 for human), signaling a shift towards impersonal, formal rhetoric ("objectivity bias").
  • Semantic Inheritance: Transformed reviews (rewritten, extract regenerate, hybrid) exhibit high cosine similarity (0.92) to their human seeds, substantiating that LLMs, when constrained to preserve content, maintain evaluative logic and argument structure even as stylistic features are overwritten.
  • Citation and Engagement Markers: Transformed reviews often increase manuscript references and explicit section pointers relative to both synthetic and human reviews, reflecting both inherited and amplified scholarly engagement.

These analyses elucidate a multidimensional provenance space: stylistic realization can be machine-like while intellectual critique remains human-aligned, and vice versa.

Theoretical and Practical Implications

The principal findingโ€”that detector predictions diverge or contradict when faced with hybrid reviewsโ€”undermines the assumption that binary LLM attribution is tractable or meaningful in high-stakes, collaborative scientific workflows. The study argues that extant detectors implicitly solve distinct tasks (surface style versus semantic reasoning detection) without explicit modeling, leading to misleading or incommensurate results under real-world LLM usage patterns.

From a practical perspective, reliance on these systems in peer review governance risks misclassifying valid human expertise rendered through AI-aided writing, or, conversely, failing to identify reviews with superficial human polish but hollow machine-derived reasoning. The theoretical implication is that authorship in contemporary knowledge work must be modeled as a continuum, demanding provenance quantification frameworks capable of resolving the interplay between semantic source and linguistic surface.

Prospects for Future Development

PeerPrism lays the groundwork for future research directions, including:

  • Quantitative Provenance Estimation: Moving beyond binary classification to probabilistically estimate degrees of semantic and stylistic contribution per review segment.
  • Instance-Level Attribution: Finer-grained, phrase- or sentence-level attribution of human and AI influence.
  • Robustness to Model Progression: Continuous benchmarking across the evolving LLM landscape, accommodating increases in fluency and human-likeness.
  • Ethical and Policy Considerations: Development of standards and disclosure policies that reflect multidimensional authorship, mitigating institutional, reputational, and governance risks.

Conclusion

PeerPrism (2604.14513) demonstrates that LLM detection in scientific peer review is a multidimensional attribution problem in which authorship cannot be reliably reduced to a binary human-versus-AI distinction. Detector predictions under hybrid authorship regimes are inconsistent and often contradictory, reflecting the conflation of surface stylistic signals with deeper intellectual provenance. These findings necessitate a paradigm shift: future detection and evaluation mechanisms must quantify and disentangle semantic and textual provenance, enabling both rigorous measurement and principled stewardship of human-AI collaboration in high-impact evaluative settings.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 20 likes about this paper.