Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Published 21 May 2024 in cs.SE and cs.AI | (2405.13101v2)

Abstract: This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that AI-generated code shows varied levels of correctness and runtime performance across nine programming languages.
The paper uses tasks like numerical integration, a conjugate gradient solver, and a parallel heat equation solver to rigorously benchmark ChatGPT’s code generation, highlighting language-specific challenges.
The paper finds that although Matlab, C++, and Java yield superior code quality, AI systems still face hurdles in parallel programming and error handling.

Evaluating AI-Generated Code Across Multiple Programming Languages

This essay explores the assessment of AI-generated code, focusing on the capabilities of ChatGPT versions 3.5 and 4.0 to generate code in C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust. The study evaluates accuracy and efficiency in producing functional scientific programs, specifically numerical integration, conjugate gradient solver, and parallel 1D stencil-based heat equation solver.

Introduction to AI Code Generation

AI code generation represents a transformative approach in software development, providing rapid and cost-effective solutions tailored to specific requirements. Tools like GitHub Copilot have emerged to assist developers, offering functionalities such as code autocompletion and interactive coding support through tiered subscriptions. However, the emergence of AI code generators challenges traditional programming education and practices. While they simplify coding tasks, issues of code correctness and maintainability persist, necessitating comprehensive evaluation.

ChatGPT, built on OpenAI's generative pre-trained transformer architecture, is a prominent player, capable of producing human-like text and code. The paper investigates ChatGPT's coding capabilities, emphasizing the importance of accuracy in compiling and executing generated code across diverse programming languages.

Methodology of Code Evaluation

The evaluation involved instructing ChatGPT to generate code for three computational problems: numerical integration, a conjugate gradient solver, and a parallel 1D heat equation solver. These tasks were chosen for their relevance in scientific computing and varied complexity.

For numerical integration, the objective was to compute the area under $\sin(x)$ from $-\pi$ to $\frac{2}{3}\pi$ . The conjugate gradient solver involved solving $A \times x = b$ for a symmetric positive definite matrix $A$ . Lastly, the heat equation solver assessed ChatGPT's proficiency in parallel programming, crucial for high-performance computing applications.

Analysis of Code Quality and Performance

The paper extensively analyzed code compilation, runtime performance, and correctness across all languages. While most generated codes compiled successfully, runtime errors and incorrect outputs highlighted limitations in AI-generated programming.

Compilation and Runtime

For numerical integration, Fortran faced compilation issues, whereas Java and C++ codes executed correctly. Conjugate gradient solvers generally compiled well, except for Rust and Fortran in version 3.5. Parallel heat equation solvers revealed challenges in parallel programming, with significant runtime errors and incorrect outputs, underscoring the complexity of such tasks.

Quality Metrics

The study utilized Constructive Cost Model (COCOMO) metrics and code line analyses to assess quality. Matlab and R produced concise code, contrasting with languages like Kokkos and Go, which had more extensive code lines. Regarding code quality, Matlab, C++, and Java demonstrated superior performance, suggesting robustness for complex computational tasks.

Implications and Future Directions

The implications of this study are profound, underscoring the potential yet current limitations of AI in code generation. While certain languages showed promise, broader adoption necessitates addressing accuracy and complexity challenges in AI-generated code. Improved AI models with enhanced comprehension and parallel computing capabilities could revolutionize scientific programming.

Future research could focus on refining AI models to better handle parallel computing constructs and enhance code correctness. Additionally, specialized metrics for assessing parallel programming quality could provide more insightful evaluations, benefiting both developers and researchers.

Conclusion

The study highlights the capabilities and challenges of AI-generated code, with findings emphasizing the need for ongoing improvement in AI systems for scientific programming. While advancements are notable, practical adoption requires meticulous attention to code quality, accuracy, and performance, paving the way for robust AI-assisted development in diverse programming environments.

Markdown Report Issue