CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Published 25 May 2021 in cs.SE and cs.AI | (2105.12655v2)

Abstract: Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase software development productivity and modernize legacy applications. Advances in deep learning and machine learning algorithms have enabled numerous breakthroughs, motivating researchers to leverage AI techniques to improve software development efficiency. Thus, the fast-emerging research area of AI for Code has garnered new interest and gathered momentum. In this paper, we present a large-scale dataset CodeNet, consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages, which is aimed at teaching AI to code. In addition to its large scale, CodeNet has a rich set of high-quality annotations to benchmark and help accelerate research in AI techniques for a variety of critical coding tasks, including code similarity and classification, code translation between a large variety of programming languages, and code performance (runtime and memory) improvement techniques. Additionally, CodeNet provides sample input and output test sets for 98.5% of the code samples, which can be used as an oracle for determining code correctness and potentially guide reinforcement learning for code quality improvements. As a usability feature, we provide several pre-processing tools in CodeNet to transform source code into representations that can be readily used as inputs into machine learning models. Results of code classification and code similarity experiments using the CodeNet dataset are provided as a reference. We hope that the scale, diversity and rich, high-quality annotations of CodeNet will offer unprecedented research opportunities at the intersection of AI and Software Engineering.

Abstract PDF Upgrade to Chat

Citations (191)

View on Semantic Scholar

Summary

The paper introduces CodeNet, a dataset with over 14M code samples across 55 languages that sets a new benchmark for AI in coding tasks.
It employs rich metadata and preprocessing tools to support tasks like code classification, similarity analysis, and program translation.
Baseline experiments using models such as BERT and GNN demonstrate enhanced generalization and performance compared to previous datasets.

Overview of CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

The publication introduces CodeNet, a large-scale dataset designed to enhance AI techniques applied to software engineering and code-related tasks. With a collection of over 14 million code samples across 55 different programming languages, CodeNet serves as a rich resource for accelerating AI research in coding domains. The dataset aims to address critical coding tasks including code similarity, classification, translation, and runtime/memory optimization.

CodeNet stands out due to its vast scale and diversity, surpassing previous datasets like POJ-104 and GCJ-297 in terms of number of code samples and languages covered. The provision of annotated metadata along with sample input/output test cases for the majority of submissions enhances its utility for benchmarking and testing coding tasks.

Statistical Summary and Dataset Characteristics

Code Samples and Languages: CodeNet comprises 13.9 million submissions for 4,053 problems, supporting 55 languages with C++, Python, Java, and C being predominant.
Annotations and Metadata: Each code sample is accompanied by metadata detailing problem descriptions, submission outcomes, and technical constraints like CPU time and memory usage.
Data Quality and Usability: By including preprocessing tools like tokenizers and simplified parse trees (SPTs), CodeNet facilitates the transformation of source code into machine learning-compatible representations. Efforts have been made to address duplicates and similar submissions to ensure data quality.

Comparative Evaluation

In comparison to existing datasets, CodeNet provides several significant advantages:

Scale and Variety: The dataset is an order of magnitude larger than its peers, offering more comprehensive coverage of coding problems and languages.
Annotations: Comprehensive metadata facilitates numerous applications, from learning code semantics to optimizing code performance.
Data Quality: CodeNet incorporates significant data cleansing measures, including the identification and removal of near-duplicates and similar problems.

Use Cases and Implications

The dataset opens up various avenues for research and application, including:

Code Search and Clone Detection: The rich variety of type-4 similarity data supports advancements in code search algorithms.
Program Translation: CodeNet's extensive programming language variety offers a fertile ground for developing program translation models using techniques inspired by natural language processing.
Performance Enhancement: Metadata on runtime and memory use facilitates the development of models for predicting and optimizing code performance.

Experimental Insights

The authors conducted several baseline experiments using subsets of CodeNet, including code classification, similarity analysis, and token inference via masked LLMs. Models like BERT and Graph Neural Networks (GNNs) demonstrated varying degrees of success in these tasks. Notably, results suggest superior generalization capabilities for CodeNet-trained models compared to those trained on other datasets.

Future Prospects

The paper outlines plans for community engagement through contests and challenges aimed at driving innovation in AI for code. By fostering partnerships with initiatives such as Women in Data Science, the project emphasizes diversity and capacity building within the AI research community.

In conclusion, CodeNet represents a significant contribution to the AI-driven exploration of code and software engineering. Its extensive scale, rich annotations, and preprocessing support promise to advance numerous areas of research within AI for code. The dataset not only sets a new benchmark for datasets in this domain but also invites further collaboration and exploration among researchers.

Markdown Report Issue