MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Published 19 Aug 2025 in cs.CL | (2508.14146v4)

Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. LLMs have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MMReview, a benchmark designed for LLM-based peer review automation, covering 13 tasks across 4 disciplines.
The paper demonstrates that larger models and chain-of-thought prompting yield more structured, accurate reviews in extensive experiments.
The paper reveals that integrating multimodal data significantly enhances model robustness against prompt injection and improves evaluative performance.

MMReview: A Comprehensive Benchmark for Peer Review Automation

The paper "MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation" (2508.14146) addresses the limitations of existing LLM-based review systems in the peer review process, particularly their inadequacy in handling multimodal content effectively. To fill this gap, MMReview is introduced as a benchmark designed to evaluate LLMs and Multimodal LLMs (MLLMs) across diversified tasks and scientific disciplines. MMReview aims to establish a standardized evaluation framework that can comprehensively assess model capabilities in generating detailed reviews that align with human standards.

Benchmark Design and Methodology

MMReview comprises an elaborate construction pipeline divided into three main stages: data collection, processing, and task construction.

Figure 1: The construction pipeline of MMReview. The construction pipeline is divided into three stages: data collection, data processing, and task construction.

Data Collection and Processing

The benchmark is built on an extensive collection of 51,881 papers sourced from open review platforms, focusing on four major disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. The data is curated to ensure balanced representation across accepted and rejected papers. The papers undergo a multistage processing pipeline that includes filtering for quality and ensuring a balanced distribution. This exhaustive process results in a curated set of 240 samples representing 17 research domains.

Task Construction

MMReview incorporates 13 diverse tasks grouped into four thematic categories: step-based, outcome-based, preference-based, and attack-based tasks. These tasks aim to evaluate various aspects of peer review, such as the generation of review content, alignment with human preference, and robustness to adversarial inputs. Each task is designed to emulate specific aspects of the peer review process, enabling detailed assessment of model capabilities in summarization, strengths and weaknesses evaluation, scoring, decision-making, and preference ranking.

Experimental Evaluation

The experiments involve extensive evaluations of 18 open-source models and 3 closed-source models, assessing their performance across various tasks under different input modalities, including text-only, multimodal, and PDF-as-image inputs.

Figure 2: The average scores under text-only input setting, with context length measured in tokens.

Figure 3: The average scores under pdf-as-image input setting, with context length measured in the number of images.

Key Findings

Model Scale Matters: Larger models outperform others in generating structured and reliable review comments, indicating a correlation between model size and review quality.
Reflective Reasoning Benefits: Step-by-step reasoning enhances review quality, with Chain-of-Thought prompting resulting in lower Mean Absolute Error compared to direct judgment tasks.
Multimodal Robustness: Inclusion of multimodal inputs significantly improves model robustness against prompt injection, suggesting superior evaluative capabilities when image data complements textual analysis.

Future Directions

The benchmark's implementation demonstrates potential for significant improvements in automating aspects of peer review using LLMs. Future work may focus on expanding MMReview to include more comprehensive datasets and refining tasks to capture nuanced aspects of peer review processes. Exploration into strengthening model resistance to input perturbations and biases also remains critical.

Conclusion

MMReview establishes an essential foundation for evaluating LLM-based systems in the peer review domain, paving the way for developing robust automated systems that can assist human reviewers. The integration of multimodal inputs and diverse tasks provides a comprehensive framework that supports nuanced assessments of LLM capabilities, advocating for their potential role in scholarly publishing workflows.