Papers
Topics
Authors
Recent
Search
2000 character limit reached

AIDev Public Dataset Overview

Updated 1 February 2026
  • AIDev Public Dataset is a large-scale resource capturing autonomous coding agents’ pull requests and interactions in real-world GitHub environments.
  • It aggregates detailed metrics on code changes, review timings, and acceptance rates to enable robust analysis of human–AI collaboration.
  • The dataset’s extensible design supports benchmarking, trust calibration, and governance studies in evolving SE 3.0 workflows.

The AIDev Public Dataset is a large-scale, structured resource capturing the emergence and in situ operation of autonomous coding agents—such as OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code—across the diverse real-world environments of open source software engineering. Unlike synthetic or curated code benchmarks, AIDev empirically records how such agentic teammates initiate, review, and integrate code within actual multi-developer, multi-repository workflows, supporting research in agent benchmarking, pacing, collaboration modeling, productivity optimization, and governance in “SE 3.0” environments (Li et al., 20 Jul 2025).

1. Scope, Coverage, and Motivation

AIDev was motivated by critical empirical gaps left unaddressed by function-level or token-level LLM benchmarks, such as SWE-bench. It aggregates 456,535 autonomous pull requests (PRs), or “Agentic-PRs,” generated by five leading AI coding agents spanning 61,453 GitHub repositories and 47,303 identified human developers. The dataset’s comprehensive structure enables quantification of speed, acceptance, complexity, and collaboration dynamics unique to the interplay of human developers and agentic teammates.

A central tenet is providing ground-truth data that reflect not only agent productivity and technical output, but also the nuanced sociotechnical phenomena arising from real-world use—such as reduced merge rates despite increased code throughput and structural simplicity of agent-generated code. This supports systematic study of trust, review mechanics, and workflow integration in AI-native software engineering environments (Li et al., 20 Jul 2025).

2. Data Acquisition, Schema, and Storage

AIDev collects its data via targeted GitHub REST API queries using branch prefixes (e.g., head:codex/), PR bodies (e.g., “Co-Authored-By: Claude”), and bot account naming conventions. The data span each agent’s public release date through June 22, 2025.

The schema centers on the pull_request.csv table, indexed by unique PR identifier, and includes the following fields: PR/repo/user IDs, agent flag, creation/closure timestamps, merge status, commit identifiers, free-text title/body, and linkage to supporting tables covering repository metadata, user details (bot/human), timeline event audit trails (opened, merged, labeled), PR review records (reviewer, timestamp, action), all code diffs per commit (additions, deletions, patch text), PR-to-issue linking, and additional issue metadata.

All data are delivered as CSV files, with SQL schema recreation scripts and pandas-compatible Python notebooks included in the repository (https://github.com/SAILResearch/AI_Teammates_in_SE3).

Table Key Fields Example Purpose
pull_request.csv pr_id, repo_id, author_id, agent_flag, timestamps Core PR-level analytics
pr_commit_details commit_id, file_path, patch_text, code metrics Fine-grained code change analysis
pr_reviews pr_id, review_id, reviewer_id, state, submitted_at Review and gating process

3. Agent Distribution, PR Flow, and Acceptance Metrics

Codex PRs account for ∼90% (411,621) of Agentic-PRs, with Devin (24,893), Copilot (16,531), Cursor (1,981), and Claude Code (1,509) comprising the remainder. These PRs are distributed across projects in all major programming languages and repository maturities.

Task-type distributions mirror human contributions (>55% of agent PRs are features or fixes). However, Copilot’s output skews towards fix tasks (42.2%), while Cursor (41.7%) and Claude Code (49.5%) are more feature-focused.

Acceptance (merge) rates are consistently lower for agents: Codex (64% merged), Devin (49%), Copilot (35%)—vs. 79% for human PRs. Notably, agent PRs involving documentation are merged at slightly higher rates than those of human authors (Codex docs merged at 88.6% vs. 76.5% for humans), suggesting task-type-dependent boundary conditions for trust and acceptance (Li et al., 20 Jul 2025).

Median review-to-resolution times are typically shorter for accepted agent PRs (Codex: 0.3 h; humans: 3.9 h), but rejected PRs from agents are either quickly dismissed or languish longer (Copilot rejected: 4.6 h vs. human: 27.6 h).

4. Code Complexity, Structural Analysis, and Metrics

AIDev supports detailed measurement of code change complexity. Cyclomatic complexity (v(G)=EN+2Pv(G) = E - N + 2P, with EE edges, NN nodes in the control flow graph, PP connected components) and Halstead volume (based on operator/operand counts) are computed from PR diffs. Only 9.1% of Codex PRs alter cyclomatic complexity compared to 23.3% of human PRs, and the variance among non-zero ΔCC\Delta CC is substantially narrower for agentic code interventions. A plausible implication is that agents tend to generate structurally simpler, more boilerplate-like changes, with rare excursions into higher-complexity modifications (Li et al., 20 Jul 2025).

5. Trust, Collaboration, and Workflow Dynamics

Despite strong performance on curated benchmarks (>70% solution rates), agent PRs exhibit a substantial “trust gap” in live workflows: real-world merge rates for agent PRs remain in the 35–65% range. Human review remains the dominant gatekeeping mechanism, with 58.2% of agentic PRs receiving no explicit review, and only 21.8% reviewed by humans alone. There is a pronounced polarization for Codex PRs—rapid acceptance or rapid rejection—suggesting that reviewers may be both highly discerning and responsive to agent-originated submissions. Subgroup analyses indicate that documentation or boilerplate code are more likely to be accepted, while functionally or structurally complex contributions arouse greater scrutiny (Li et al., 20 Jul 2025).

Furthermore, developer throughput accelerates markedly in the presence of agents: e.g., one developer produced 164 Codex PRs in three days, compared to 176 human PRs over three years, underscoring a fundamental shift in the human–AI productivity dynamic.

6. Usage Recommendations, Pitfalls, and Extensibility

The dataset is designed for extensibility and direct analytics. Recommended filterings include restriction to mature projects (e.g., star_count ≥ 500) and normalization of all timestamps to UTC. Users should join pull_request with pr_commit_details for per-PR code statistics, and be aware of the following pitfalls: missing agent_flag metadata (e.g., if “Co-Authored-By” is disabled), partial or malformed PR diffs, bot account renaming, and latency artifacts from CI/CD throttling.

Key research directions include benchmarking new agent designs, classifying failure/rejection modes by mining feedback, modeling human–agent code reviews, designing triage systems to optimize human review allocation, and longitudinal analyses linking agent interventions to post-merge defects.

AIDev thus anchors empirical investigation into SE 3.0—supporting analysis of real-world agent impact, governance, and hybrid team workflow optimization at unprecedented scale and fidelity (Li et al., 20 Jul 2025).

7. Significance, Applications, and Future Directions

AIDev enables a spectrum of research and development use cases:

  • Benchmarking upcoming autonomous coding agents against in-the-wild PR performance and acceptance rates.
  • Robust statistical modeling of collaboration patterns, review bottlenecks, and developer adaptation.
  • Design and evaluation of governance frameworks, trust calibration mechanisms, and review automation in large-scale industrial repositories.
  • Analysis of code complexity evolution and agent-driven productivity surges in multi-repository contexts.
  • Longitudinal tracking of agent-generated code quality via downstream signals such as bug reports and issue reopening rates.

The dataset is continually extensible and provided as a living resource, supporting upgrades with additional agent categories, extended time spans, and integration with further code metrics and review taxonomy. Its open-access structure, rich metadata, and direct connection to real-world SE workflows mark it as a foundational asset for SE 3.0 and human–AI collaboration research (Li et al., 20 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AIDev Public Dataset.