When is Generated Code Difficult to Comprehend? Assessing AI Agent Python Code Proficiency in the Wild

Published 31 Mar 2026 in cs.SE | (2604.00299v1)

Abstract: The rapid adoption of AI coding agents is fundamentally shifting software developers' roles from code authors to code reviewers. While developers spend a significant portion of their time reading and comprehending code, the linguistic proficiency and complexity of the Python code generated by these agents remain largely unexplored. This study investigates the code proficiency of AI agents to determine the skill level required for developers to maintain their code. Leveraging the AIDev dataset, we mined 591 pull requests containing 5,027 Python files generated by three distinct AI agents and employed pycefr, a static analysis tool that maps Python constructs to six proficiency levels, ranging from A1 (Basic) to C2 (Mastery), to analyze the code. Our results reveal that: AI agents predominantly generate Basic-level code, with over 90% of constructs falling into the A1 and A2 categories, and less than 1% classified as Mastery (C2); AI agents' and humans' pull requests share a broadly similar proficiency profile; High-proficiency code by AI agents are from feature addition and bug fixing tasks. These findings suggest that while AI-generated code is generally accessible to developers with basic Python skills, specific tasks may require advanced proficiency to review and maintain complex, agent-generated constructs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper finds that AI agents generate predominantly beginner-level (A1/A2) code, with over 90% of constructs falling in this range.
It employs a large-scale CEFR-based pycefr analysis on 591 PRs to systematically compare AI-generated and human-authored Python code.
The study reveals that complex, advanced constructs arise mainly in feature addition and bug fix tasks, suggesting targeted review strategies.

Assessing Python Code Proficiency in AI Agent-Generated Code

Introduction

The adoption of AI-based code generation agents (e.g., GitHub Copilot, Cursor, Devin) is transforming core workflows in contemporary software engineering, shifting the developer's function toward code review and maintenance. Despite the high throughput of AI code generation observed in real-world development (e.g., over 400,000 PRs by Codex in <2 months), there has been limited empirical assessment of the linguistic and structural complexity of generated code. This study conducts a large-scale analysis of Python code generated by AI agents, employing the pycefr tool to systematically classify code constructs according to the CEFR-inspired six-level proficiency schema (A1 to C2). The central objectives are to quantify the proficiency profile of generated code, compare AI-generated code to human-authored code, and identify PR task types driving the production of advanced code constructs.

Proficiency Profiling with CEFR and Pycefr

The analytical framework is built on the Common European Framework of Reference for Languages (CEFR), which segments language proficiency into six progressive levels (A1, A2, B1, B2, C1, C2). The pycefr static analysis tool extends this taxonomy to Python code by mapping language constructs (e.g., list comprehensions, generator expressions, decorators) to these levels, thereby estimating the implicit proficiency required for comprehension and maintenance.

The dataset is curated from the AIDev corpus, filtering to 591 PRs from 145 starred Python repositories linked to AI agent authorship. The extraction pipeline reconstructs both human and AI-generated code diffs to isolate newly introduced code, ensuring analytic granularity at the construct level for reliable proficiency assessment.

Figure 1: Study overview illustrating data curation, code extraction, and pycefr-based proficiency analysis pipelines.

Distribution of Proficiency Levels in AI-Generated Code

The dominant finding is that AI agents overwhelmingly generate code at the lower proficiency spectrum. Specifically, over 90% of code constructs are classified as A1 (Breakthrough) or A2 (Waystage), with less than 1% at C2 (Mastery). Intermediate-level (B1/B2) and advanced (C1/C2) constructs are rare across all agents, with C1 peaking at 1.18% (Copilot) and C2 never exceeding 0.44% for any agent. Statistical analysis indicates that, although intra-agent differences in proficiency distributions are significant, their effect size is negligible.

These results establish that current-generation AI agents align generated code toward the linguistic/construct complexity accessible to developers with only basic Python proficiency.

Comparative Analysis of AI versus Human-Authored Code

Upon controlling for repository and project alignment, human-written and agent-generated code exhibit broadly similar proficiency distributions. Both predominantly employ A1 and A2 level constructs, with intermediate and proficient features comprising less than 10%. Notably, AI agents produce marginally more C1 code (1.12% versus 0.85%), while humans introduce slightly more C2 constructs (0.57% versus 0.42%). Both differences are statistically significant yet have minimal effect sizes; hence, in operational terms, the review and maintenance burden for AI-generated code mirrors the landscape faced when dealing with human-written code at scale.

PR Task Typology and Emergence of High-Proficiency Code

Analysis of outlier PRs—those containing abnormally high numbers of C1 or C2 constructs—reveals a strong association with feature additions and bug fixes. Feature addition PRs account for the bulk of high-proficiency code, while build and chore tasks exhibit the highest average advanced construct counts per PR. Refactoring and performance tasks, although less frequent, also contribute to the advanced code footprint.

Figure 2: Task distribution among outlier agent PRs with extreme counts of proficient code (log-scaled C1 + C2 occurrences).

This pattern suggests that AI agents will produce complex, high-proficiency code primarily under tasks demanding substantial structural or behavioral extension.

Representative Example of Complex AI-Generated Constructs

A concrete instance is provided by a code edit in the "AGI-Alpha-Agent-v0" repository, where an AI agent introduces a nested list comprehension with an in-line conditional (see below). This idiom is concise but considered a barrier to comprehension for less proficient Python developers due to its compactness and syntactic density.

Figure 3: AI agent code change deploying a list comprehension with an inline if clause.

Such constructs, while idiomatic, increase the necessary review proficiency and may impede onboarding or maintenance efforts unless reviewers possess commensurate expertise.

Implications and Future Directions

These findings demonstrate that, while generator agents predominantly produce code tractable to programmers with elementary Python proficiency, specific PRs—especially those associated with new features or complex patches—may necessitate advanced review skills. From a process management perspective, PR triage and code review assignment may warrant explicit stratification based on measured code proficiency to ensure match between code complexity and reviewer skill sets.

On the research front, these results motivate further inquiry into the impact of code proficiency on review outcomes, defect rates, and project maintainability in AI-augmented workflows. Investigation into agent prompting languages, fine-tuning for coherent code proficiency, and task-specific code complexity profiling are also logical continuations.

Conclusion

Systematic analysis of AI-generated Python code in open-source PRs indicates strong predominance of beginner and elementary-level constructs, yielding a maintenance scenario comparable to human-origin code. However, advanced constructs emerge in the context of complex tasks, suggesting the need for enhanced review practices and targeted upskilling. The proficiency-aware assessment mechanic outlined here represents a robust baseline for future empirical and tool-based research into the intersection of AI agent coding and collaborative software development.

Markdown Report Issue