Papers
Topics
Authors
Recent
Search
2000 character limit reached

RovoDev Code Reviewer: A Large-Scale Online Evaluation of LLM-based Code Review Automation at Atlassian

Published 3 Jan 2026 in cs.SE, cs.AI, cs.CL, and cs.LG | (2601.01129v2)

Abstract: LLMs-powered code review automation has the potential to transform code review workflows. Despite the advances of LLM-powered code review comment generation approaches, several practical challenges remain for designing enterprise-grade code review automation tools. In particular, this paper aims at answering the practical question: how can we design a review-guided, context-aware, quality-checked code review comment generation without fine-tuning? In this paper, we present RovoDev Code Reviewer, an enterprise-grade LLM-based code review automation tool designed and deployed at scale within Atlassian's development ecosystem with seamless integration into Atlassian's Bitbucket. Through the offline, online, user feedback evaluations over a one-year period, we conclude that RovoDev Code Reviewer is effective in generating code review comments that could lead to code resolution for 38.70% (i.e., comments that triggered code changes in the subsequent commits); and offers the promise of accelerating feedback cycles (i.e., decreasing the PR cycle time by 30.8%), alleviating reviewer workload (i.e., reducing the number of human-written comments by 35.6%), and improving overall software quality (i.e., finding errors with actionable suggestions).

Summary

  • The paper presents a large-scale evaluation of RovoDev, an LLM-based tool that automates context-aware and actionable code review comments.
  • The evaluation across 1,900+ repositories shows a 38.7% code resolution rate and a 30.8% reduction in median pull request cycle time.
  • The study underscores the significance of structured prompting and dual quality checks, prioritizing actionable feedback over mere factual correctness.

RovoDev Code Reviewer: Transforming Code Review Automation

The paper "RovoDev Code Reviewer: A Large-Scale Online Evaluation of LLM-based Code Review Automation at Atlassian" (2601.01129) provides an in-depth analysis of the RovoDev Code Reviewer, a large-language-model (LLM)-based tool implemented at Atlassian for automating code reviews. This essay offers a comprehensive summary and critique of the tool's development, deployment, and evaluation, discussing its implications for software engineering.

Introduction and Motivation

The core motivation behind RovoDev Code Reviewer lies in addressing the growing complexity and resource-intensiveness of manual code reviews. As software projects scale, traditional code review processes become bottlenecks, slowing feature delivery and impacting productivity. With advances in LLM-powered automation, RovoDev aims to transform code review workflows by automating comment generation, thus expediting feedback cycles and enhancing software quality.

System Architecture and Components

The RovoDev Code Reviewer employs three key components to generate high-quality code review comments:

  1. Zero-Shot Context-Aware Review-Guided Comment Generation: Utilizing structured prompting, RovoDev generates context-sensitive comments without fine-tuning on proprietary data, ensuring compliance with data privacy and security regulations. This feature leverages available pull request and Jira issue data to guide comment generation.
  2. Comment Quality Check on Factual Correctness: To mitigate LLM hallucination issues, RovoDev employs an LLM-as-a-Judge framework, verifying that generated comments are factually aligned with the respective code.
  3. Comment Quality Check on Actionability: A ModernBERT-based model filters out non-actionable or vague comments, focusing on comments that are likely to lead to code modifications. This ensures that generated comments provide clear, actionable insights. Figure 1

    Figure 1: An overview of our RovoDev Code Reviewer.

Large-Scale Online Evaluation

Effectiveness and Workflow Impact

RovoDev Code Reviewer was integrated across more than 1,900 Atlassian repositories, generating over 54,000 comments. Its effectiveness was quantified through two primary metrics: the code resolution rate and improvements in workflow metrics.

  • Code Resolution Rate: RovoDev achieved a resolution rate of 38.70%, demonstrating considerable practical value. Although slightly lower than the human baseline of 44.45%, it highlights the tool's robustness in resolving issues autonomously (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2: The code resolution rate of RovoDev-generated comments and human-written comments over a one-year period.

  • Workflow Metrics: Adoption of RovoDev reduced the median pull request (PR) cycle time by 30.8% and decreased human-written comments by 35.6%, underscoring its role in accelerating development cycles and alleviating reviewer workloads.

User Perception

Qualitative feedback from developers indicated that the tool's ability to provide accurate error detection and actionable suggestions was well-received. However, challenges remain in ensuring contextual coherence, such as understanding programming languages and environments within the codebase.

Evaluation of Human-Alignment

A significant aspect of the study evaluated the alignment between RovoDev-generated and human-written comments. Despite only 7% of pull requests containing fully human-aligned comments, the tool's high resolution rate signifies its utility beyond traditional alignment metrics. This suggests a shift in evaluating automated code review tools, emphasizing practical use over mere surface similarity (Figure 3). Figure 3

Figure 3: An evaluation of LLM-human comment alignment, measured by \%HAC (\% of RovoDev-generated comments that are aligned with human-written comments, capturing both location and semantic similarity).

Discussion

Impact of Prompt Components and Quality Checks

Detailed analysis revealed that providing structured review guidelines, task definitions, and code changes were critical prompt components, significantly impacting the tool's effectiveness. Interestingly, the actionability check was more influential than the factual correctness check in refining comment alignment and quality (Figure 4). Figure 4

Figure 4: The impact of the prompt components and comment quality check on the effectiveness of RovoDev Code Reviewer. The impact is measured by the absolute percentage difference (\%control_\mathrm{control}).

Conclusion

RovoDev Code Reviewer represents a substantial advancement in code review automation, demonstrating the efficacy of integrating LLM-driven automation within industrial settings. Future work should focus on expanding the tool's contextual capabilities and enhancing its adaptability to diverse programming environments to further augment its practical utility.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The study leaves several aspects unresolved. Future work could address:

  • External validity beyond Atlassian: Evaluate RovoDev across other organizations, open-source repositories, and diverse development workflows to determine generalizability of observed benefits (CRR, PR cycle time, reduced human comments).
  • Comparative baselines: Benchmark RovoDev against fine-tuned LLMs, RAG-based systems, static analyzers/linters, and rule-based reviewers; include head-to-head comparisons under identical conditions.
  • Component ablation: Quantify the marginal impact of each pipeline element (persona, review guidelines, chain-of-thought, factual-correctness judge, actionability gate) via controlled ablation studies.
  • Reliability of LLM-as-a-Judge: Validate the factual-correctness and semantic-similarity judges against expert human annotations; report precision/recall, calibration, inter-annotator agreement, and failure modes (e.g., adversarial or ambiguous code changes).
  • Actionability gate performance: Provide model metrics (AUC, precision/recall, calibration curves) and cross-project generalization for the ModernBERT quality gate; assess bias introduced by training on “comments that led to resolution” and quantify false positives/false negatives (useful comments filtered out or poor comments passed).
  • Definition and detection of “code resolution”: Disclose the algorithm used to link comments to subsequent code changes, handle multi-line edits, merges, reverts, and unrelated commits; quantify linkage accuracy and ambiguity.
  • Causal inference on PR cycle time and human comments: Replace observational analyses with randomized A/B tests or difference-in-differences/synthetic control designs to isolate causal effects and control for concurrent process changes or tooling rollouts.
  • Downstream software quality: Measure post-merge defect rates, security vulnerabilities, rework/churn, rollback frequency, and long-term maintainability to ensure accelerated cycles don’t degrade quality.
  • Negative side effects and risk: Quantify harms from incorrect or non-actionable comments (e.g., wasted effort, misguidance), cognitive load, and trust erosion; establish safeguards and escalation paths for high-risk suggestions.
  • Language and framework coverage: Systematically evaluate performance across programming languages, frameworks, and stacks; address misclassification (e.g., PHP vs JavaScript) with robust language detection and stack-specific prompts/models.
  • Context limitations: Investigate context enrichment strategies under LLM context-window constraints (e.g., code graphs, dependency summaries, semantic slicing), and compare zero-shot prompting to lightweight RAG in cold-start projects.
  • Comment type distribution and prioritization: Analyze which categories (bugs, readability, maintainability, tests, security) RovoDev handles well or poorly; tune the pipeline to avoid overemphasis on nitpicks and under-detection of high-impact issues.
  • Operational cost and latency: Report inference latency, computational cost per PR, throughput under load, and the impact on developer wait times; conduct cost–benefit analyses at scale.
  • Privacy and compliance trade-offs: Clarify how ModernBERT fine-tuning on proprietary data complies with privacy constraints; assess risks of data leakage, model inversion, and retention policies; provide guidance for deployments with stricter regulatory requirements.
  • Prompt transparency and reproducibility: Share prompt templates (or sanitized variants), review guidelines, and configuration details to enable replication; evaluate sensitivity to prompt variations and drift over time.
  • Interpretation of low human alignment: With only ~4% HAC, investigate whether divergence from human-written comments reflects better or worse utility; develop utility-focused metrics beyond textual similarity and location matching.
  • Feedback collection robustness: Improve mechanisms for gathering fine-grained user feedback (beyond low click-through on thumbs up/down), e.g., in-line rating prompts, structured surveys, and passive telemetry; address response bias.
  • Security-specific effectiveness: Evaluate RovoDev’s ability to detect security vulnerabilities, insecure patterns, and compliance violations; compare against dedicated security tools and secure coding checklists.
  • Cold-start vs mature projects: Empirically test performance in repos with limited history versus mature codebases; identify strategies that mitigate lack of prior context.
  • Developer learning and team dynamics: Study impacts on reviewer skill development, knowledge transfer, mentoring, and psychological safety; ensure automation doesn’t erode essential human aspects of code review.
  • Governance and explainability: Define processes for accountability, traceability, and explainability of automated comments; provide rationale summaries and confidence indicators to support human oversight.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.