Agentic Pull Requests

Updated 31 January 2026

Agentic pull requests are autonomous code contributions generated entirely by AI agents, executing planning, staging, and submission without direct human curation.
Empirical analyses indicate that agentic PRs feature concentrated commit changes, increased symbol churn, and distinct review dynamics compared to human submissions.
Large-scale studies from the AIDev dataset show that agentic PRs are increasingly used for tests, documentation, and configuration, prompting new governance and quality controls.

Agentic pull requests (PRs) are pull requests authored and submitted by autonomous coding agents—software entities capable of generating, staging, and proposing code or documentation changes to repositories without direct human authorship of the content. These agents (e.g., OpenAI Codex, GitHub Copilot, Devin, Cursor, Claude Code) are now deployed at scale on platforms such as GitHub, introducing a new paradigm of software contribution. Distinct from AI-assisted PRs—which are ultimately curated, described, and submitted by humans—agentic PRs involve end-to-end agent autonomy, encompassing code synthesis, test generation, commit construction, and PR submission. This entry reviews empirical findings concerning agentic PRs: their prevalence, code and test contributions, review dynamics, code quality, acceptance rates, failure patterns, and workflow implications, with references to leading studies from the AIDev dataset corpus and related empirical analyses.

1. Definition and Scope of Agentic Pull Requests

Agentic PRs are operationalized as pull requests where the attributed author(s) in commit or PR metadata correspond to AI agent identities and no human directly rewrites or submits the principal changes. Canonical agentic PR authors include identities such as "github-copilot[bot]", "openai-codex[bot]", "devin[bot]", "cursor[bot]", and "claude-code[bot]". Strict agentic PRs involve autonomous planning, staging, and submission by the agent; hybrid cases (e.g., "co-authored by" agents alongside humans) are sometimes included in broader agentic analyses but are typically distinguished from fully autonomous submissions (Haque et al., 7 Jan 2026, Yoshioka et al., 26 Jan 2026, Ehsani et al., 21 Jan 2026, Ogenrwot et al., 24 Jan 2026).

Dataset coverage has rapidly expanded, with the AIDev dataset comprising over 933,000 agentic PRs across 61,000 repositories as of mid-2025. These contributions span general code, tests, configuration, documentation, and platform-specific artifacts.

2. Empirical Characteristics: Size, Structure, and Diff Alignment

Agentic PRs differ from human PRs along multiple structural dimensions. Across >24,000 merged agentic PRs and >5,000 merged human PRs, agentic submissions exhibit:

Fewer commits per PR but a higher relative concentration of changes per commit (Cliff's δ=0.54, large effect).
Slightly more localized code modifications, with medium effect sizes for files touched and lines deleted, but smaller effect for lines added (Ogenrwot et al., 24 Jan 2026).
PR descriptions produced by agents exhibit high semantic similarity to their code diffs (CodeBERT median cosine >0.93) and, on average, are marginally more aligned than human PRs according to lexical and semantic similarity metrics (TF-IDF cosine, Okapi BM25, CodeBERT/GraphCodeBERT cosines), although the lexical overlap remains low (median TF-IDF cosine ≈0.1) (Ogenrwot et al., 24 Jan 2026, Pham et al., 24 Jan 2026).
Symbol churn is substantially higher: agentic PR-introduced functions and classes are removed more often (7.33% of symbols vs. 4.10% for human-introduced), and much faster (median removal in 3 days vs. 34 days for humans), indicating more short-lived code and high activity on documentation and test infrastructure (Pham et al., 24 Jan 2026).

PR-level summaries by agents show a micro/macro precision gap: commit-level messages are more aligned with their respective code changes (higher sim_commit), but full-PR descriptions are less coherent with the aggregate commit set than those authored by humans (lower sim_PR, median 0.86 agent vs. 0.88 human) (Pham et al., 24 Jan 2026).

3. Review Dynamics, Acceptance, and Reviewer Behavior

Merge rates for agentic PRs are lower than human baselines and highly agent-dependent. In large population studies (Yoshioka et al., 26 Jan 2026, Ehsani et al., 21 Jan 2026, Watanabe et al., 18 Sep 2025):

Submitter Type	Merge Rate (%)
Human PRs	~79–91
Codex agent	~63–86
Copilot agent	~48–56
Devin agent	~44–57
Claude Code agent	~58–72

Agentic PRs show distinct review dynamics:

Self-merge prevalence is very high: ~77.5% of merged agentic PRs are merged by the submitting agent identity, compared to 57.6% for humans (Yoshioka et al., 26 Jan 2026).
Reviewer comments have opposite effects: in humans, more reviewer comments increase merge odds (+2.7% per comment, p<0.001); in agentic PRs, more comments indicate increased likelihood of rejection (–2.8% per comment, p<0.001) (Yoshioka et al., 26 Jan 2026).
Review intensity for accepted agentic PRs is substantial, though core developers engage more deeply than peripheral ones; most review topics cluster around core implementation, evolvability (code organization, alternative solutions), documentation gaps, and test coverage (Cynthia et al., 27 Jan 2026, Haider et al., 27 Jan 2026).
Thematic analysis finds the top review tags on agentic PRs are function/logic (feat, 38.5%), refactoring (14%), documentation (11.4%), style (10.3%), and undo/revert (10%) (Haider et al., 27 Jan 2026).

Instant merge is common (28.3% of agentic PRs merge in under 1 minute), but iterative review loops and abandonment ("ghosting") are more frequent in the remaining 71.7%, especially when change scope is large, tests/CI are touched, or structural complexity is high (Minh et al., 2 Jan 2026).

4. Task Distribution, Testing, and Specialized Contributions

Agentic PRs span diverse tasks, with the following patterns:

Testing: Test inclusion in agentic PRs has grown from 31% in January to 52% in July 2025. These test-containing PRs (test PRs) are statistically larger and take substantially longer to complete, with median LOC churn 2–10× higher than non-test PRs and median turnaround times multiplied 4–10× depending on agent. Merge rates for test PRs are steady or even higher (except for Devin), indicating no explicit reviewer penalty for test contribution (Haque et al., 7 Jan 2026).
Documentation: Agentic PRs account for ~74% of documentation-only PRs in repositories with ≥500 stars, and agent-authored documentation is typically integrated with very little human revision (mean retention of agent-supplied lines is 87%). This raises concerns about the rigor and reliability of documentation quality assurance in agent-driven workflows (Yamasaki et al., 28 Jan 2026).
CI/CD and configuration: Only 3.25% of agentic PR file changes are to YAML (most are not genuine CI/CD), and 96.8% of agent CI/CD edits affect GitHub Actions. Merge and build success rates remain comparable between CI/CD and non-CI/CD PRs, with Copilot showing apparent specialization and superior merge/success rates for configuration PRs (Ghaleb, 24 Jan 2026).
Libraries and dependencies: 29.5% of agentic PRs include at least one import, but only 1.3% add a new external dependency—when they do, 75% specify versions, outperforming direct LLM prompting by a wide margin. Agents rely on a broad set of libraries, particularly for testing and developer tooling (Twist, 12 Dec 2025).
Security: Security-relevant agentic PRs comprise ~4% of activity, focusing more on hardening/documentation than vulnerability fixes. Merge rates are significantly lower than non-security PRs (61.5% vs. 77.3%), and review latency is tenfold higher (median 3.9h vs. 0.1h), reflecting deeper scrutiny (Siddiq et al., 1 Jan 2026).
Energy and performance: Energy-explicit agentic PRs demonstrate correct application of energy profiling and optimization techniques but have lower acceptance rates when optimizations impact maintainability (82% vs. ~92% overall) (Mitul et al., 31 Dec 2025). Performance-related PRs are mainly development-phase optimizations, with 63.5% acceptance but higher rejection for UI, analytics, and AI-inference tweaks (Opu et al., 31 Dec 2025).

5. Failure Modes, Quality, and Review Bottlenecks

The main determinants of agentic PR failure are socio-technical and workflow misalignment, rather than raw code correctness (Ehsani et al., 21 Jan 2026, Gong et al., 8 Jan 2026):

Primary merge-blocking patterns include lack of reviewer engagement (38%), duplication (23%), CI/test failures (17%), and misalignment with reviewer or contribution norms.
Not-merged agentic PRs are larger, touch more files, and fail CI-checks more frequently. Each failed CI check reduces merge odds by ~15%.
High message–code inconsistency (PR-MCI) remains rare (1.7% of PRs), but severely penalizes acceptance (28.3% merge rate for high-MCI vs. 80.0% for low-MCI) and incurs a 3.5× longer time to merge. The dominant inconsistency is phantom changes, where the PR description claims test or code additions not substantiated in the diff (45% of high-MCI cases) (Gong et al., 8 Jan 2026).
Reviewer effort is predicted primarily by static properties of the PR diff: size, files touched, and presence of CI/tests. Semantic features from PR titles, descriptions, and BERT embeddings are negligible contributors to review-effort prediction (AUC ≈ 0.52–0.57) (Minh et al., 2 Jan 2026).

6. Best Practices, Review Policy, and Workflow Integration

Empirical recommendations for integrating agentic PRs into development pipelines include:

Enforcement of small, task-scoped PRs: Merge rates and review outcomes are superior for small, localized, CI-verified agentic PRs (Ehsani et al., 21 Jan 2026, Haque et al., 7 Jan 2026, Cynthia et al., 27 Jan 2026).
Automated gating and triage: Early-stage classifiers using only static code metrics ("Circuit Breaker" models) can flag high-burden PRs at creation time, enabling maintainers to allocate review resources efficiently (Minh et al., 2 Jan 2026).
Differential quality checks: Size-aware static analysis with remediation thresholds (e.g., SonarQube, flake8) should be embedded pre-merge for all agentic PRs, and code-quality trends monitored longitudinally (Cynthia et al., 27 Jan 2026).
Explicit human oversight for agentic self-merges: Reviewer or maintainer approval policies must prevent unconditional self-merges by agents, closing a major workflow vulnerability (Yoshioka et al., 26 Jan 2026).
Review prompt engineering: Agents should be guided to generate precise commit-level messages and full-PR summaries to reduce the micro/macro fidelity gap and mitigate documentation lapses (Pham et al., 24 Jan 2026, Haider et al., 27 Jan 2026).
Documentation and test workflow governance: Buddy review, integration of doc-quality linters, and enforcing doc-only PR scope are necessary to sustain quality for agent-generated documentation (Yamasaki et al., 28 Jan 2026).

7. Cross-Agent and Developer Role Variation

Agentic PR production and downstream integration vary substantially by agent and by developer social role:

Agent	Test-inclusion rate trend	Test-to-code churn ratio	Merge rate
Codex	39%→58%	0.61	63–86%
Copilot	35%→44%	0.87	48–56%
Cursor	14%→23%	0.42	71–76%
Claude	37%→55%	0.42	59–72%
Devin	≈31% (constant)	0.56	44–57%

Peripheral developers employ agents more evenly across tasks, including bug fixing and feature addition, and are less likely to run CI before merging than core developers. Core developers use agents mainly for documentation and tests, require stricter CI gating, and more frequently merge to main branches (Cynthia et al., 27 Jan 2026).

In summary, agentic pull requests constitute a new, empirically distinct regime in collaborative software engineering, exhibiting unique dynamics in code structure, review, test and doc practices, quality outcomes, and team integration (Haque et al., 7 Jan 2026, Yoshioka et al., 26 Jan 2026, Ehsani et al., 21 Jan 2026, Ogenrwot et al., 24 Jan 2026, Pham et al., 24 Jan 2026, Watanabe et al., 18 Sep 2025). Successful adoption demands both automated governance mechanisms and policy/process evolution to ensure that agentic contributions remain reviewable, maintainable, and aligned with human contributor standards.