Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance
Abstract: The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Practical Applications
Immediate Applications
The following bullet points summarize practical, deployable uses of the paper’s findings and methods across sectors. Each item notes relevant sectors and any key tools/workflows and assumptions that affect feasibility.
- Task-aware agent selection and routing in software teams
- Sectors: software, finance, healthcare, robotics, energy
- Use case: Route tasks to agents based on empirically strong categories (e.g., documentation to Claude Code; bug fixes to OpenAI Codex or Cursor; features to Claude Code; refactors to Codex).
- Tools/workflows: “Agent Router” service in CI/CD; PR templates that include a task-type label; GitHub Actions to tag agent and task type on PRs and dispatch to the preferred assistant.
- Assumptions/dependencies: Requires reliable task-type classification (e.g., automated labeling in PR templates), access to multiple agents, and organizational willingness to maintain routing rules. Acceptance rate is used as a proxy for success; quality must be safeguarded via static analysis and testing.
- Adaptive code review policies by task type
- Sectors: software, finance, healthcare (regulated domains)
- Use case: Increase review depth for task types with lower acceptance rates (e.g., performance, tests, features); relax review gates for documentation/chore tasks while maintaining security checks.
- Tools/workflows: CI gates that enforce minimum review counts based on task type; reviewer assignment algorithms that prioritize senior reviewers for fix/perf tasks; automated static analysis and security scans as mandatory checks for AI-authored code.
- Assumptions/dependencies: Review frequency correlates with task complexity; requires integration with repository policies and security tooling. Acceptance rates do not equal code quality—static analysis and risk checks are essential.
- Agent performance monitoring dashboards
- Sectors: software, product tool vendors, enterprise IT
- Use case: Track acceptance rate over time per agent (e.g., Devin’s improving trend) and by task type; monitor model drift and identify when an agent’s strengths change.
- Tools/workflows: A PR analytics dashboard that charts acceptance by agent-task-week, LOESS smoothing for trends, and alerts when acceptance materially deviates; tagged PRs including agent identity and task type.
- Assumptions/dependencies: Requires consistent tagging of PRs; observational nature implies non-causal interpretations. Data access and instrumentation across repositories must be standardized.
- Combined-agent workflows for open-source maintainers
- Sectors: software/open source
- Use case: Publish contribution guidelines that specify which tasks are suitable for AI-generated PRs (e.g., docs/chore) and require extra evidence/tests for fix/perf tasks submitted by agents.
- Tools/workflows: Contribution policy pages with task-type guidance; labels for “AI-authored” PRs; auto-comment bots that request tests or benchmarks for high-risk task types.
- Assumptions/dependencies: Community governance support and maintainers’ capacity to enforce policies; accurate task categorization.
- Procurement and tool evaluation checklists for organizations
- Sectors: finance, healthcare, government IT, enterprise software
- Use case: Avoid relying solely on global metrics when comparing coding assistants; require task-stratified performance reports, temporal trend analysis, and evidence of statistical significance.
- Tools/workflows: RFP templates that demand per-task acceptance rates, Bonferroni-adjusted statistical tests, and effect sizes (phi) for core activities like fix and feat.
- Assumptions/dependencies: Vendors can provide task-stratified data; teams can instrument trials to collect comparable per-task outcomes.
- Developer-level best practices for daily use
- Sectors: daily life (individual developers, educators)
- Use case: Choose agents strategically per task (docs/chore to agents with higher acceptance; fixes to Cursor/Codex; avoid trusting AI fully on performance-critical code); maintain a local tracking sheet of acceptance outcomes.
- Tools/workflows: IDE macros or extensions to select agents per task; local scripts that log agent, task type, and outcome; habit of running static/security scans on AI-generated changes.
- Assumptions/dependencies: Access to multiple assistants; willingness to instrument personal workflow; data privacy constraints when using cloud-based agents.
- Vendor and research communications with task-stratified leaderboards
- Sectors: software tool vendors, academia
- Use case: Publish per-task leaderboards instead of “one-number” rankings; highlight strengths (e.g., Codex on fix/refactor; Cursor on tests; Claude Code on docs/features).
- Tools/workflows: Public documentation and benchmark pages using the paper’s stratification; regular updates with temporal plots; replication materials for transparency.
- Assumptions/dependencies: Ongoing data collection; acceptance rates interpreted alongside quality signals.
Long-Term Applications
The following longer-horizon applications will benefit from further research, scaling, standardization, or development before widespread deployment.
- Multi-agent orchestration platforms (“AgentOps” control planes)
- Sectors: software, enterprise IT, devtool vendors
- Use case: Automatically classify tasks and route them to the best-performing agent; observe temporal changes and dynamically update routing; integrate quality gates (tests, static analysis) to auto-merge low-risk tasks (e.g., docs/chore) while enforcing human-in-the-loop for high-risk ones.
- Tools/products: Agent marketplace with per-task ratings; orchestration layer for VS Code/JetBrains; policy-as-code modules for CI pipelines.
- Assumptions/dependencies: Robust task classification, standardized agent APIs, reliable quality metrics beyond acceptance, privacy/security compliance across agents.
- Standardized, task-stratified benchmarking consortia
- Sectors: academia, tool vendors, standards bodies
- Use case: Establish community benchmarks that report per-task acceptance and quality metrics; extend datasets (e.g., SWE-bench variants) to include stratified categories and longitudinal tracking.
- Tools/products: Benchmark suites with replication packages; public leaderboards with effect sizes and multiple-comparison controls; mixed-effects models that account for repository-level variance.
- Assumptions/dependencies: Shared datasets across diverse projects (beyond 100+ star repos), consensus on reporting standards, and sustained funding.
- Causal and quality-aware evaluation frameworks
- Sectors: academia, regulated industries
- Use case: Move beyond acceptance to include maintainability, security, and technical debt; incorporate mixed-effects models to mitigate repository clustering; use difference-in-differences and quasi-experimental designs to estimate impact.
- Tools/products: Research toolkits combining static analysis, complexity metrics, vulnerability scanning, and longitudinal drift detection.
- Assumptions/dependencies: Access to fine-grained code quality signals; agreement on validity of proxies; institutional support for controlled trials.
- Sector-specific AI coding governance and compliance
- Sectors: healthcare, finance, energy, government
- Use case: Define policies that restrict autonomous agent contributions on safety-critical/performance-sensitive modules; require per-task stratified evidence in audits and change control; mandate disclosure of AI-generated code.
- Tools/products: Compliance templates and audit checklists; automated policy enforcement in CI; documentation standards for AI-authored PRs.
- Assumptions/dependencies: Regulatory buy-in; integration with existing SDLC controls; clear accountability lines for AI contributions.
- Adaptive review allocation algorithms
- Sectors: software, enterprise IT
- Use case: Use historical per-task acceptance and quality data to allocate reviewer expertise and count dynamically; prioritize senior reviewers for fix/perf tasks where agents underperform.
- Tools/products: ML services that predict review needs; workload balancers that reduce bottlenecks while protecting quality.
- Assumptions/dependencies: Access to labeled historical data; cultural acceptance of algorithmic workload assignment; safeguards against bias.
- Curriculum and training programs on task-stratified AI-assisted software engineering
- Sectors: education, professional upskilling
- Use case: Teach developers to interpret stratified metrics, design agent routing policies, and recognize when acceptance rates can mislead; train on integrating quality gates for AI code.
- Tools/products: Course modules, labs using public replication packages, capstone projects building AgentOps pipelines.
- Assumptions/dependencies: Academic-industry collaboration; accessible datasets and tools; evolving best practices.
- Cost-performance and risk calculators for AI coding assistants
- Sectors: enterprise IT, finance, tool vendors
- Use case: Model ROI by task mix (e.g., documentation-heavy vs. fix-heavy teams), factoring acceptance, review costs, and post-merge maintenance; aid procurement and budgeting decisions.
- Tools/products: Dashboards that simulate outcomes under different task distributions; per-task cost/quality tradeoff reports.
- Assumptions/dependencies: Accurate cost models for review and maintenance; incorporation of quality signals beyond acceptance.
- Longitudinal drift detection and change management
- Sectors: software, devtool vendors
- Use case: Monitor agents for performance drift over time (e.g., Devin’s improving trend vs. other stable agents) and trigger controlled rollouts, A/B tests, or routing updates.
- Tools/products: Observability pipelines for agent performance; governance processes for safe agent updates.
- Assumptions/dependencies: Continuous instrumentation; disciplined release management; agreement on drift thresholds.
- Safe auto-merge policies for low-risk task categories
- Sectors: open source, enterprise software
- Use case: Auto-merge AI-generated documentation/chore PRs that pass static checks and linters; require manual gates for fix/feat/perf tasks.
- Tools/products: CI policies, rule engines, “trust tiers” by task type.
- Assumptions/dependencies: High-confidence task classification; robust static analysis coverage; clear rollback procedures.
- Repository-level modeling and fairness safeguards
- Sectors: academia, tool vendors
- Use case: Develop methods that explicitly model repository clustering and policy heterogeneity; ensure fair cross-agent comparisons and mitigate bias from project selection.
- Tools/products: Statistical libraries for mixed-effects modeling; data schemas that capture repository characteristics.
- Assumptions/dependencies: Rich metadata on repositories; community agreement on fairness criteria.
Collections
Sign up for free to add this paper to one or more collections.