Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

Published 28 May 2025 in cs.CL | (2505.23824v2)

Abstract: Recent advancements in LLMs have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from multiple vendors and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that LLMs can serve as effective quality checkers by detecting critical errors in scientific papers with high hit rates and cost efficiency.
It details baseline approaches comparing PDF and LaTeX inputs and highlights performance variations among models like OpenAI's o3 and Gemini.
The study proposes an automatic evaluation framework using multiple LLM judges to mitigate bias and benchmark scientific error detection.

Automated Evaluation of LLMs for Scientific Paper Quality Checking

Motivation and Context

The acceleration of scientific publication rates has imposed unprecedented strain on traditional peer review mechanisms, culminating in what is described as the "peer review crisis." LLMs have been proposed as scalable solutions for review generation, but prior studies highlight concerns over superficiality and lack of critical reasoning in LLM-generated reviews. This paper pioneers a shift from full review generation toward manuscript quality checking as a targeted application for LLMs, emphasizing identification of critical errors that threaten the validity of scientific conclusions.

Dataset Construction and Task Definition

A central component is the WithdrarXiv-Check dataset: curated from the WithdrarXiv corpus of withdrawn arXiv papers, filtered for clear and in-manuscript-identifiable retraction reasons. Post-processing involved LLM-assisted and manual screening, correcting for ambiguous or templated comments, non-English texts, version confounds, and error categories not detectable through document inspection. The resulting dataset consists of 1,225 cases, predominantly from mathematics and physics domains, with structured retraction reasons amenable to automated evaluation.

The task is defined as detection of "critical errors and unsoundness problems" via LLMs acting as quality checkers, with performance measured on ability to identify these issues in papers as presented.

Baseline Approaches and Input Modalities

Three baselines were explored:

PDF Attachment: Papers provided as PDFs, leveraging vendor-specific document ingestion pipelines.
OCR-based Prompt: Not directly evaluated; proposed for future work, mindful of transcription errors.
LaTeX Script Prompt: Papers rendered via LaTeX source, ensuring fidelity of mathematical content but risking informational noise and loss.

Both PDF and LaTeX approaches were evaluated; the latter restricted to papers with available source scripts. Images were excluded due to technical constraints.

Automatic Evaluation Framework

The framework operationalizes LLMs-as-judges: multiple top-performing reasoning LLMs independently assess the outputs of LLM checkers. Evaluation metrics are:

Hit Rate at $k$ (HR@ $k$ ): Proportion of papers for which a checker identifies a problem exactly matching the gold retraction reason, as voted by all judges.
Average Precision (AP@ $k$ ): Fraction of checker-identified problems deemed true positives by all judges.

To mitigate potential false positives from single-model biases, majority voting among judges from different vendors is imposed. API usage, token counts, and estimated costs are monitored, informing practical scalability.

Experimental Results

Experiments utilized prominent LLMs: Google's Gemini 2.5 Pro and Flash, OpenAI's o3 and o4-mini, Anthropic's Claude 3.7 Sonnet. Key findings include:

Problem Identification: OpenAI's o3 consistently achieved the highest hit rates (PDF: 48.2%; LaTeX: 50.6%) and made full use of allowed problem slots. Gemini models exhibited slightly lower hit rates and were more conservative in reporting issues.
Format Robustness: Gemini models showed reduced performance when switching to LaTeX, suggesting sensitivity to format changes. o-series models were largely unaffected, indicating format invariance or superior training on LaTeX.
Precision: Gemini 2.5 Pro delivered higher precision (35.2%) compared to o3 (29.5%), reflecting a trade-off between caution and coverage. Claude Sonnet's precision was relatively high (36.4%) despite significant deficits in hit rate and problem reporting.
Token and Cost Dynamics: Substantial differences in token consumption highlight divergent vendor pipelines for PDF processing. o3 offered superior cost-efficiency after recent API pricing revisions.
Judging Behavior: Evaluation with dual-LLM judging demonstrated reduced susceptibility to hallucinations or leniency (e.g., Gemini Pro's tendency for affirmative votes without strict evidence).

Numerical results validate the capability of leading LLMs to autonomously detect critical flaws in scientific manuscripts, with o3 establishing itself as the most competent checker in both coverage and cost metrics.

Implications and Limitations

Practical implications include:

Benchmarking: The framework and dataset constitute a reusable benchmark for document-based scientific reasoning and error detection, enabling systematic assessment and improvement of LLMs in this domain.
Workflow Integration: LLM quality checkers are positioned as augmentative tools for preliminary manuscript screening, reducing reviewer burden without displacing domain expertise.

Methodological limitations are acknowledged:

Exclusive reliance on closed-source LLMs without comparison to open-source alternatives.
Automatic evaluation introduces circularity; absence of domain expert calibration risks overestimation or bias.
Restriction to math and physics domains, limiting generalizability.
Potential contamination of training data, although empirical evidence suggests minimal memorization impact.

Future directions include expanding to broader scientific domains, inclusion of supplementary material and references, customizing prompts by scientific field, explore multi-agent collaborative workflows, and engaging human experts for benchmark curation and validation.

Comparison and Complementarity

The work contrasts with concurrent studies such as "When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research" (Son et al., 17 May 2025), which focus more on benchmarking and annotation quality. Key differentiators include dataset scale, input normalization, evaluation flexibility, and judge plurality. Both lines of research converge on the superior performance of OpenAI's o3 model, reinforcing its position in scientific error detection tasks.

Conclusion

This paper delineates a formal framework and extensible baselines for leveraging LLMs in automated detection of critical errors in scientific papers. Through rigorous evaluation on a large filtered dataset of withdrawn arXiv papers, it demonstrates that state-of-the-art reasoning LLMs—particularly OpenAI's o3—can achieve strong hit rates while maintaining cost efficiency. The methodology and results set a foundation for integrating LLMs as auxiliary quality checkers in peer review workflows, pending ethical, legal, and sociotechnical safeguards, thereby supporting the integrity of the scientific publication process as submission volumes continue to escalate.