Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Published 9 Feb 2026 in cs.SE | (2602.08915v1)

Abstract: The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

Summary

  • The paper shows that task type strongly influences PR acceptance rates, with documentation tasks achieving 82.1% compared to 66.1% for feature tasks.
  • The study employs temporal analysis, revealing that Devin improved its acceptance rate by 0.77% weekly while other agents maintained stable performance.
  • By using task-stratified comparisons and statistical tests, the paper highlights that no single AI agent excels across all task types, emphasizing context-aware evaluations.

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Introduction

The advancement of AI-powered coding assistants is reshaping software development practices, necessitating systematic evaluations of these tools across various contexts. The paper "Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance" (2602.08915) undertakes an empirical study examining five mainstream AI coding agents—OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code—through an analysis of 7,156 pull requests (PRs) from the AIDev dataset. The study provides insights into temporal performance trends, factors influencing PR acceptance rates, and comparative evaluations of agent efficacy across different task types.

The findings highlight the significant task type influence on acceptance rates, with documentation tasks having notably higher acceptance than feature-related tasks. Furthermore, Devin showed a consistent improvement in acceptance rate over time, distinguishing itself from other agents that demonstrated stability. Importantly, OpenAI Codex exhibited uniformly strong performance across multiple task categories. Figure 1

Figure 1: Acceptance rate over time per agent.

Methodology

The analysis was conducted using the AIDev dataset, filtered to include 7,156 PRs meeting specific quality criteria. The study employed temporal trend analysis and task-stratified statistical comparisons to discern variations in AI agent performance. Temporal performance was gauged using linear regression models and LOESS smoothing techniques. Task-stratified comparisons were executed using Pearson's Chi-square tests to evaluate inter-agent performance efficacy across distinct task types.

The study primarily focused on three research questions:

  1. RQ1: Assessing the evolution of AI coding agents' performance over time.
  2. RQ2: Identifying factors—such as task type—that significantly affect performance.
  3. RQ3: Comparing performance across different agents through task-stratified analysis.

Results

The analysis revealed divergent performance patterns among the agents. Devin was noted for its continuous improvement—showing a +0.77% weekly increase in acceptance rate—while OpenAI Codex and GitHub Copilot maintained high, stable rates. Figure 1

Figure 1: Acceptance rate over time per agent.

Task type emerged as a predominant factor influencing PR outcomes, with a 16 percentage-point gap observed between documentation (82.1% acceptance) and feature tasks (66.1% acceptance). Review frequency also varied, impacting acceptance rates, though potential confounders such as task complexity or repository policies were noted.

RQ3: Task-Stratified Agent Comparisons

The study conducted pairwise agent comparisons, revealing that no single agent consistently outperformed across all task types. OpenAI Codex demonstrated high acceptance rates across task categories, whereas Claude Code and Cursor shone in specific domains like documentation and test tasks, respectively. Figure 2

Figure 2: Acceptance rates (\%) by agent and task type. dagger'' indicates \<20 PRs;---'' indicates zero PRs.

Discussion

The paper underscores the significance of task type in assessing AI coding agents' performance, advocating task-stratification as essential for accurate evaluations. Temporal analysis indicated Devin's unique improvement trajectory, while other agents displayed plateauing trends. The findings suggest practitioners should consider task context and temporal dynamics when selecting tools, while researchers should incorporate task-stratified methodologies and complementary metrics such as static analysis for comprehensive evaluations.

Conclusion

The study successfully illustrates how AI coding agents' performance significantly varies across different tasks and over time. It establishes the necessity of task-stratified comparisons and highlights the influence of temporal factors, empowering practitioners and researchers with robust insights for selection and evaluation strategies in AI-assisted software development.

Overall, these results advocate a nuanced approach to evaluating and deploying AI coding assistants, stressing the importance of context-aware methodologies and sensitivity to temporal dynamics.

For further details and supplementary materials, refer to the study's GitHub repository.

Whiteboard

Practical Applications

Immediate Applications

The following bullet points summarize practical, deployable uses of the paper’s findings and methods across sectors. Each item notes relevant sectors and any key tools/workflows and assumptions that affect feasibility.

  • Task-aware agent selection and routing in software teams
    • Sectors: software, finance, healthcare, robotics, energy
    • Use case: Route tasks to agents based on empirically strong categories (e.g., documentation to Claude Code; bug fixes to OpenAI Codex or Cursor; features to Claude Code; refactors to Codex).
    • Tools/workflows: “Agent Router” service in CI/CD; PR templates that include a task-type label; GitHub Actions to tag agent and task type on PRs and dispatch to the preferred assistant.
    • Assumptions/dependencies: Requires reliable task-type classification (e.g., automated labeling in PR templates), access to multiple agents, and organizational willingness to maintain routing rules. Acceptance rate is used as a proxy for success; quality must be safeguarded via static analysis and testing.
  • Adaptive code review policies by task type
    • Sectors: software, finance, healthcare (regulated domains)
    • Use case: Increase review depth for task types with lower acceptance rates (e.g., performance, tests, features); relax review gates for documentation/chore tasks while maintaining security checks.
    • Tools/workflows: CI gates that enforce minimum review counts based on task type; reviewer assignment algorithms that prioritize senior reviewers for fix/perf tasks; automated static analysis and security scans as mandatory checks for AI-authored code.
    • Assumptions/dependencies: Review frequency correlates with task complexity; requires integration with repository policies and security tooling. Acceptance rates do not equal code quality—static analysis and risk checks are essential.
  • Agent performance monitoring dashboards
    • Sectors: software, product tool vendors, enterprise IT
    • Use case: Track acceptance rate over time per agent (e.g., Devin’s improving trend) and by task type; monitor model drift and identify when an agent’s strengths change.
    • Tools/workflows: A PR analytics dashboard that charts acceptance by agent-task-week, LOESS smoothing for trends, and alerts when acceptance materially deviates; tagged PRs including agent identity and task type.
    • Assumptions/dependencies: Requires consistent tagging of PRs; observational nature implies non-causal interpretations. Data access and instrumentation across repositories must be standardized.
  • Combined-agent workflows for open-source maintainers
    • Sectors: software/open source
    • Use case: Publish contribution guidelines that specify which tasks are suitable for AI-generated PRs (e.g., docs/chore) and require extra evidence/tests for fix/perf tasks submitted by agents.
    • Tools/workflows: Contribution policy pages with task-type guidance; labels for “AI-authored” PRs; auto-comment bots that request tests or benchmarks for high-risk task types.
    • Assumptions/dependencies: Community governance support and maintainers’ capacity to enforce policies; accurate task categorization.
  • Procurement and tool evaluation checklists for organizations
    • Sectors: finance, healthcare, government IT, enterprise software
    • Use case: Avoid relying solely on global metrics when comparing coding assistants; require task-stratified performance reports, temporal trend analysis, and evidence of statistical significance.
    • Tools/workflows: RFP templates that demand per-task acceptance rates, Bonferroni-adjusted statistical tests, and effect sizes (phi) for core activities like fix and feat.
    • Assumptions/dependencies: Vendors can provide task-stratified data; teams can instrument trials to collect comparable per-task outcomes.
  • Developer-level best practices for daily use
    • Sectors: daily life (individual developers, educators)
    • Use case: Choose agents strategically per task (docs/chore to agents with higher acceptance; fixes to Cursor/Codex; avoid trusting AI fully on performance-critical code); maintain a local tracking sheet of acceptance outcomes.
    • Tools/workflows: IDE macros or extensions to select agents per task; local scripts that log agent, task type, and outcome; habit of running static/security scans on AI-generated changes.
    • Assumptions/dependencies: Access to multiple assistants; willingness to instrument personal workflow; data privacy constraints when using cloud-based agents.
  • Vendor and research communications with task-stratified leaderboards
    • Sectors: software tool vendors, academia
    • Use case: Publish per-task leaderboards instead of “one-number” rankings; highlight strengths (e.g., Codex on fix/refactor; Cursor on tests; Claude Code on docs/features).
    • Tools/workflows: Public documentation and benchmark pages using the paper’s stratification; regular updates with temporal plots; replication materials for transparency.
    • Assumptions/dependencies: Ongoing data collection; acceptance rates interpreted alongside quality signals.

Long-Term Applications

The following longer-horizon applications will benefit from further research, scaling, standardization, or development before widespread deployment.

  • Multi-agent orchestration platforms (“AgentOps” control planes)
    • Sectors: software, enterprise IT, devtool vendors
    • Use case: Automatically classify tasks and route them to the best-performing agent; observe temporal changes and dynamically update routing; integrate quality gates (tests, static analysis) to auto-merge low-risk tasks (e.g., docs/chore) while enforcing human-in-the-loop for high-risk ones.
    • Tools/products: Agent marketplace with per-task ratings; orchestration layer for VS Code/JetBrains; policy-as-code modules for CI pipelines.
    • Assumptions/dependencies: Robust task classification, standardized agent APIs, reliable quality metrics beyond acceptance, privacy/security compliance across agents.
  • Standardized, task-stratified benchmarking consortia
    • Sectors: academia, tool vendors, standards bodies
    • Use case: Establish community benchmarks that report per-task acceptance and quality metrics; extend datasets (e.g., SWE-bench variants) to include stratified categories and longitudinal tracking.
    • Tools/products: Benchmark suites with replication packages; public leaderboards with effect sizes and multiple-comparison controls; mixed-effects models that account for repository-level variance.
    • Assumptions/dependencies: Shared datasets across diverse projects (beyond 100+ star repos), consensus on reporting standards, and sustained funding.
  • Causal and quality-aware evaluation frameworks
    • Sectors: academia, regulated industries
    • Use case: Move beyond acceptance to include maintainability, security, and technical debt; incorporate mixed-effects models to mitigate repository clustering; use difference-in-differences and quasi-experimental designs to estimate impact.
    • Tools/products: Research toolkits combining static analysis, complexity metrics, vulnerability scanning, and longitudinal drift detection.
    • Assumptions/dependencies: Access to fine-grained code quality signals; agreement on validity of proxies; institutional support for controlled trials.
  • Sector-specific AI coding governance and compliance
    • Sectors: healthcare, finance, energy, government
    • Use case: Define policies that restrict autonomous agent contributions on safety-critical/performance-sensitive modules; require per-task stratified evidence in audits and change control; mandate disclosure of AI-generated code.
    • Tools/products: Compliance templates and audit checklists; automated policy enforcement in CI; documentation standards for AI-authored PRs.
    • Assumptions/dependencies: Regulatory buy-in; integration with existing SDLC controls; clear accountability lines for AI contributions.
  • Adaptive review allocation algorithms
    • Sectors: software, enterprise IT
    • Use case: Use historical per-task acceptance and quality data to allocate reviewer expertise and count dynamically; prioritize senior reviewers for fix/perf tasks where agents underperform.
    • Tools/products: ML services that predict review needs; workload balancers that reduce bottlenecks while protecting quality.
    • Assumptions/dependencies: Access to labeled historical data; cultural acceptance of algorithmic workload assignment; safeguards against bias.
  • Curriculum and training programs on task-stratified AI-assisted software engineering
    • Sectors: education, professional upskilling
    • Use case: Teach developers to interpret stratified metrics, design agent routing policies, and recognize when acceptance rates can mislead; train on integrating quality gates for AI code.
    • Tools/products: Course modules, labs using public replication packages, capstone projects building AgentOps pipelines.
    • Assumptions/dependencies: Academic-industry collaboration; accessible datasets and tools; evolving best practices.
  • Cost-performance and risk calculators for AI coding assistants
    • Sectors: enterprise IT, finance, tool vendors
    • Use case: Model ROI by task mix (e.g., documentation-heavy vs. fix-heavy teams), factoring acceptance, review costs, and post-merge maintenance; aid procurement and budgeting decisions.
    • Tools/products: Dashboards that simulate outcomes under different task distributions; per-task cost/quality tradeoff reports.
    • Assumptions/dependencies: Accurate cost models for review and maintenance; incorporation of quality signals beyond acceptance.
  • Longitudinal drift detection and change management
    • Sectors: software, devtool vendors
    • Use case: Monitor agents for performance drift over time (e.g., Devin’s improving trend vs. other stable agents) and trigger controlled rollouts, A/B tests, or routing updates.
    • Tools/products: Observability pipelines for agent performance; governance processes for safe agent updates.
    • Assumptions/dependencies: Continuous instrumentation; disciplined release management; agreement on drift thresholds.
  • Safe auto-merge policies for low-risk task categories
    • Sectors: open source, enterprise software
    • Use case: Auto-merge AI-generated documentation/chore PRs that pass static checks and linters; require manual gates for fix/feat/perf tasks.
    • Tools/products: CI policies, rule engines, “trust tiers” by task type.
    • Assumptions/dependencies: High-confidence task classification; robust static analysis coverage; clear rollback procedures.
  • Repository-level modeling and fairness safeguards
    • Sectors: academia, tool vendors
    • Use case: Develop methods that explicitly model repository clustering and policy heterogeneity; ensure fair cross-agent comparisons and mitigate bias from project selection.
    • Tools/products: Statistical libraries for mixed-effects modeling; data schemas that capture repository characteristics.
    • Assumptions/dependencies: Rich metadata on repositories; community agreement on fairness criteria.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.