AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context

Published 27 Jan 2026 in cs.SE and cs.AI | (2601.19494v2)

Abstract: High-quality evaluation benchmarks are pivotal for deploying LLMs in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr-bench .

Abstract PDF Upgrade to Chat

Summary

The paper introduces AACR-Bench, a benchmark that improves defect coverage by 285% through AI-assisted and expert-verified annotations.
It demonstrates the value of repository-level context and multilingual evaluation for enhancing the performance of LLM-based code review systems.
The benchmark reveals significant language-specific performance variances, emphasizing the need for precise context retrieval and adaptation.

Evaluating Automatic Code Review with AACR-Bench

AACR-Bench represents a novel approach to assessing automatic code review systems, emphasizing the importance of repository-level context and multilingual support. By addressing the limitations of existing benchmarks, AACR-Bench provides an enriched dataset that enhances defect coverage and context awareness, ultimately improving the evaluation of LLMs in Automated Code Review (ACR) tasks.

Introduction

The increasing adoption of LLMs for ACR has highlighted the need for robust evaluation benchmarks capable of accurately assessing their performance. Previous benchmarks have been limited by their lack of support for multiple programming languages and repository-level context, as well as by their reliance on noisy, incomplete ground truth derived from real-world Pull Requests (PRs). AACR-Bench addresses these deficiencies through an "AI-assisted, Expert-verified" annotation pipeline that significantly increases defect coverage, enabling more rigorous evaluations of LLMs.

Construction and Features of AACR-Bench

Multilingual and Context-Aware Benchmark

AACR-Bench is characterized by its support for ten mainstream programming languages and its repository-level context awareness, offering a more realistic and comprehensive environment for evaluating ACR systems.

Figure 1: Overview of AACR-Bench. AACR-Bench contains 200 PRs and 1,505 fine-grained review comments extracted and curated from 50 popular repositories, covering 10 mainstream programming languages.

AACR-Bench encompasses 200 PRs and 1,505 review comments, sourced from 50 popular repositories. This benchmark features cross-file and cross-language context, addressing the limitations of previous datasets that have been restricted to specific languages or isolated code fragments.

Advanced Annotation Pipeline

The dataset has been curated using a hybrid approach that involves both AI-assisted and expert-verified annotations, yielding a 285% increase in defect coverage compared to traditional datasets. This process involves the meticulous review and annotation of comments generated by two ACR systems powered by six different LLMs.

Figure 2: Distribution of Review Comments in AACR-Bench. TS, JS, and Py stand for TypeScript, JavaScript, and Python, respectively. "Aug" denotes comments augmented from original PR reviews, while "Gen" denotes comments generated by the 6 LLMs.

Comparative Analysis with Existing Benchmarks

AACR-Bench offers significant improvements over existing benchmarks by providing repository-level context in a multi-language evaluation environment. Unlike previous benchmarks, AACR-Bench incorporates hybrid model-augmented human reviews, verified through human expert annotation, ensuring comprehensive defect coverage and reliable evaluation.

Experimental Insights

The evaluation of mainstream LLMs and ACR methods on AACR-Bench has revealed important insights:

Contextual Impact: The effectiveness of ACR methods is significantly influenced by the granularity of context provided and the choice of retrieval methods. Agent-based approaches generally exhibit higher precision, while traditional retrieval methods may introduce noise, affecting overall performance.
Language-Specific Performance: Performance varies markedly across programming languages, with substantial language-specific biases observed. This highlights the necessity of a multilingual benchmark like AACR-Bench to capture these variances.
Figure 3: Language-wise Code Review Performance.

Conclusion

AACR-Bench sets a new standard for evaluating ACR systems by emphasizing the importance of comprehensive defect coverage and repository-level context. This benchmark provides insights that are crucial for the development of next-generation LLM-based ACR systems, identifying key challenges such as precision-recall trade-offs, context adaptation, and language-specific biases. Future work will focus on expanding the dataset and improving ground truth accuracy through advanced semi-automated methods. AACR-Bench is a vital resource for researchers and developers aiming to optimize the application of LLMs in code review scenarios, driving advancements in AI-assisted software engineering.