AI-Assisted Pull Requests

Updated 26 January 2026

AI-assisted pull requests are automated code contribution processes enhanced by large language models and autonomous agents, integrating title, description, and review functions.
They are implemented via GitHub actions, IDE plugins, and CI/CD pipelines to improve code quality, testing practices, and review efficiency.
Empirical studies reveal distinct metrics in merge times, review burden, and security validation, highlighting evolving governance and operational challenges.

AI-assisted pull requests (PRs) refer to code contribution requests on collaborative software platforms that are either generated, co-authored, or reviewed with the aid of AI systems, especially LLMs and autonomous coding agents. These PRs span a spectrum from interactive human-in-the-loop workflows—such as AI-augmented code suggestions, automatic drafting of titles and descriptions, or AI-generated review comments—to end-to-end agent-authored PRs where the code and metadata are produced and submitted autonomously by LLM-based systems. The advent of such workflows, now occurring at scale in open-source and enterprise contexts, introduces new dynamics for code quality, reviewer effort, trust, governance, and software engineering methodology.

1. Scope, Definition, and Data Landscape

AI-assisted PRs leverage one or more automated tools to generate or augment code, generate or refine PR metadata (titles, descriptions), or support review processes. “Agentic PRs” denote those where the entire PR—including commits and metadata—are generated by autonomous agents (e.g., OpenAI Codex, GitHub Copilot, Devin, Cursor, Claude Code) as annotated in the AIDev dataset (over 933,000 PRs, with the high-quality “AIDev-pop” subset comprising 33,596 PRs in popular repositories) (Haque et al., 7 Jan 2026, Li et al., 20 Jul 2025).

A wide array of data resources support empirical scrutiny of these workflows, most notably AIDev, which provides structured metadata: code changes, review timelines, author/agent identity, review comments, and integration outcomes.

2. Automated Generation of PR Metadata

Description and Title Generation

LLMs are applied to automate both PR descriptions and titles, which are crucial for reviewer triage, comprehension, and downstream tooling.

Title Generation: Formulated as a one-sentence abstractive summarization problem, models such as BART (facebook/bart-base) have been fine-tuned to synthesize a concise, informative title using the concatenation of the PR description, commit messages, and associated issue titles as input (Irsan et al., 2022, Zhang et al., 2022). Fine-tuned BART achieves ROUGE-1/2/L F1 of 47.22/25.27/43.12, outperforming pointer-generator and iTAPE baselines by 24.6–40.5% on various metrics (Irsan et al., 2022, Zhang et al., 2022).
Description Generation: Automated systems (PRSummarizer, T5-based models) generate abstractive PR descriptions using sequences of commit messages and code comments as “source documents." The use of pointer-generator networks allows for OOV identifier handling. Reinforcement learning on ROUGE metrics further boosts semantic fidelity. T5-based models outperform classic extractive baselines such as LexRank, with F1 score improvements of +6.86 (R-1), +7.99 (R-2), and +5.10 (R-L) (Sakib et al., 2024, Liu et al., 2019).

Data Cleaning and Model Performance: Rigorous heuristics—removal of trivial or merge-commit messages, trivial/bare descriptions, low overlap, and minimal content PRs—yield 8–12% gains in description quality as measured by ROUGE (Tire et al., 2 May 2025). Such filters are essential for effective model training and deployment.

Integration and Practical Considerations

AI-generated metadata is typically integrated via GitHub actions, IDE plugins, or CI/CD system hooks, supporting workflows such as automatic PR description completion, one-click title generation, and review comment drafting (Irsan et al., 2022, Sakib et al., 2024, Liu et al., 2019).

3. Testing, Quality Control, and Agentic PR Practices

Test Code Contribution

Autonomous agents increasingly embed software tests in their PRs, reflecting maturation in AI-driven development (Haque et al., 7 Jan 2026). The fraction of test-containing PRs grew from 31% in January to 52% by July 2025. Test PRs are significantly larger (3–10× more LOC churn) and slower (3–11× longer median turnaround), but their merge rates approximate those of non-test PRs for most agents except Devin.

Test-to-production churn ratio (median lines changed in tests vs. production code) varies by agent—from 0.42 (Claude/Cursor) to 0.87 (Copilot), indicating diverse testing strategies. However, a substantial fraction of tests (34–59%) require revision post-submission, emphasizing continued need for human oversight. Suggested improvements include prompt engineering and fine-tuning for higher-quality test generation.

Agent-authored security-relevant PRs represent ~4% of agentic submissions, with agents more frequently performing supportive security hardening (testing, docs, config) than direct vulnerability patching (Siddiq et al., 1 Jan 2026). Security PRs have notably lower merge rates (61.5% vs. 77.3% for non-security) and longer latency due to increased human scrutiny, with rejection driven primarily by PR size/verbosity rather than explicit security signaling.

Energy Efficiency Awareness

Agentic PRs are also aware of energy concerns in code, contributing explicit energy monitoring, configuration adjustments, code-level optimizations (21 distinct techniques), and documentation (Mitul et al., 31 Dec 2025). Optimizations, though beneficial, are less likely to be merged when they result in large or complex diffs impacting maintainability.

Reviewer Effort and Triage

Agent-generated PRs display a bifurcated pattern: 28.3% are instant merges (narrow automation), whereas the rest require iterative human review ("ghosting" is common). Review burden is predicted more accurately by structural features (added/deleted LOC, files changed, entropy) than by semantic content of PR messages (AUC = 0.957 for structural-only models) (Minh et al., 2 Jan 2026). A "Circuit Breaker" triage model catches 69% of high-effort PRs at 20% review budget, enabling zero-latency governance.

DeputyDev, a deployed AI reviewer, reduces average per PR review time by 23.09% and per-LOC review time by 40.13% in controlled A/B tests. The system accepts PR metadata, ASTs, and project docs as context, and employs multiple agent evaluators (security, maintainability, …) whose comments are blended and filtered via confidence thresholds. Empirical results confirm large, statistically significant reductions in review time (Khare et al., 13 Aug 2025).

Review and Merge Patterns

Large-scale empirical work reveals that agentic PRs are reviewed and merged more quickly, with less human commentary than human-only PRs (Gao et al., 20 Jan 2026). In human+AI PRs, 79% of merges occur with no human comment or review—especially for newcomers without prior code ownership, the reverse of traditional OSS practices.

Acceptance rates for agentic PRs, while high for maintenance tasks (docs, CI, build, 74–92%), are lower for complex tasks (feature, fix, perf, 35–65%). Larger or more complex PRs, those failing CI, or those duplicating other work are less likely to be merged (Ehsani et al., 21 Jan 2026, Watanabe et al., 18 Sep 2025).

5. PR Message-Code Alignment and Reviewer Trust

Analysis of message-code inconsistency (PR-MCI) in agent-authored PRs demonstrates that high inconsistency PRs (e.g., descriptions claiming “phantom changes,” overstating/omitting actual modifications, file-type/task-type mismatches) are associated with a 51.7 percentage-point drop in acceptance and 3.5× slower merge times (Gong et al., 8 Jan 2026). Most prominent issues are Phantom Changes (45.4%), Scope Understatement (22.0%), and Placeholder Descriptions (18.8%).

Automated PR-MCI detection (using scope, file-type, and task-type metrics) offers a pathway to pre-emptively filter misleading PRs, thus sustaining reviewer trust and productivity. Recommendations include integrating MCI detection pipelines and explicit "explain-then-patch" phases, alongside closing the loop with reviewers’ feedback to enable continuous agent learning.

6. Practical Implications, Governance, and Future Research

Best Practices for Integration

Complexity Budgeting: Restrict agentic code changes, especially in early deployments, to well-scoped, low-complexity tasks to ensure high acceptance and reviewer trust (Li et al., 20 Jul 2025, Watanabe et al., 18 Sep 2025).
Pre-PR Validation: Mandate that agents run all CI checks, linters, and supply test results and provenance prior to PR submission (Ehsani et al., 21 Jan 2026, Mitul et al., 31 Dec 2025).
Documentation and Metadata Quality: Use data cleaning to ensure high information quality in generated titles and descriptions. Encourage agents to add plan/execution rationales and risk/safety summaries (Irsan et al., 2022, Tire et al., 2 May 2025, Ehsani et al., 21 Jan 2026).
Governance and Policy: Despite widespread use, only ~13% of popular OSS repositories surveyed had any explicit AI agent usage guidelines. This gap supports the need for codified governance, disclosure practices, and policy-driven PR templates (Gao et al., 20 Jan 2026).

Open Research Challenges

End-to-End Benchmarks: Moving beyond static code-completion to integration-oriented and reviewer-centered benchmarks (merge rate, review latency, defect rate, and reviewer satisfaction) (Li et al., 20 Jul 2025).
Semantic Fidelity Metrics: Improving automated consistency and correctness checking between PR messages and code diffs (Gong et al., 8 Jan 2026).
Multi-Agent Orchestration: Instrumenting and managing workflows involving multiple agents, human–AI hand-offs, and CI/system interactions (Watanabe et al., 18 Sep 2025).
Human Factors: Assessing latent costs in trust, cognitive load, and quality assurance introduced by increased automation in code review (Li et al., 20 Jul 2025, Ehsani et al., 21 Jan 2026).

7. Tables: Selected Quantitative Metrics

Metric	Agentic PRs	Human PRs	Reference
Acceptance Rate (“AIDev-pop”)	48.9–65.3% (agents)	76.8%	(Li et al., 20 Jul 2025)
Median Time-to-Merge	0.3–6.9 h (agents)	3.9 h	(Li et al., 20 Jul 2025)
Security PR Merge Rate	61.5%	77.3% (non-sec PRs)	(Siddiq et al., 1 Jan 2026)
Test Inclusion (July 2025)	52%	–	(Haque et al., 7 Jan 2026)
Energy PR Merge Rate	87%	92% (non-energy)	(Mitul et al., 31 Dec 2025)

Agent-Specific Test and Review Results

Agent	Test PR Churn (LOC)	Non-Test PR Churn (LOC)	Test PR Merge Rate	Review Time (Test PR)
Claude	1,736	183	70%	4.15 h
Codex	133	39	86%	0.01 h
Copilot	323	49	54%	24.09 h
Cursor	852	139	76%	7.04 h
Devin	335	78	44.1%	38.72 h

References

(Haque et al., 7 Jan 2026) Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests.
(Gong et al., 8 Jan 2026) Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests.
(Li et al., 20 Jul 2025) The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.
(Gao et al., 20 Jan 2026) On Autopilot? An Empirical Study of Human-AI Teaming and Review Practices in Open Source.
(Watanabe et al., 18 Sep 2025) On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub.
(Mitul et al., 31 Dec 2025) How Do Agentic AI Systems Deal With Software Energy Concerns? A Pull Request-Based Study.
(Ehsani et al., 21 Jan 2026) Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub.
(Khare et al., 13 Aug 2025) DeputyDev -- AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity.
(Tire et al., 2 May 2025) Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions.
(Zhang et al., 2022, Irsan et al., 2022) Automatic Pull Request Title Generation; AutoPRTitle: A Tool for Automatic Pull Request Title Generation.
(Sakib et al., 2024, Liu et al., 2019) Automatic Pull Request Description Generation Using LLMs; Automatic Generation of Pull Request Descriptions.