CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

Published 30 Mar 2023 in cs.LG, cs.AI, and cs.SE | (2303.17568v2)

Abstract: Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 4.7 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible and in Sep. 2022, we open-sourced its code, model weights (the version of 850B tokens), API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.

Abstract PDF Upgrade to Chat

Citations (257)

View on Semantic Scholar

Summary

The paper introduces CodeGeeX, a 13B parameter model trained on 850B tokens across 23 languages to improve code generation and translation.
It employs a 39-layer transformer architecture to deliver superior performance on the HumanEval-X benchmark with functional correctness metrics.
The model’s integration into popular IDEs enhances developer productivity, demonstrating its practical utility for real-world coding tasks.

CodeGeeX: A Multilingual Pre-Trained Model for Code Generation

The paper introduces CodeGeeX, a cutting-edge multilingual model for code generation capable of functioning across 23 programming languages with a significant parameter size of 13 billion. The model demonstrates superiority over existing multilingual code models of similar scale, emphasizing its capabilities in both code generation and translation, as evidenced by evaluations on the HumanEval-X benchmark.

Model Architecture and Training

CodeGeeX is built upon the transformer architecture, utilizing a 39-layer transformer decoder similar to the GPT paradigm. This design facilitates its autoregressive language modeling capability. With a hidden size of 5120 and an extensive vocabulary size of 52,224, CodeGeeX processes sequence lengths of up to 2048 tokens.

The pre-training of CodeGeeX involved 850 billion tokens, derived from an elaborate corpus of 23 programming languages, leveraging 1,536 Ascend 910 AI processors. This extensive dataset is a mixture of common repositories and supplementary data directly extracted from GitHub, ensuring a diverse and comprehensive pre-training phase.

Multilingual Capabilities and HumanEval-X

To rigorously evaluate CodeGeeX's performance in multilingual settings, the researchers developed HumanEval-X, an extension of the Python-only HumanEval benchmark, now encompassing C++, Java, JavaScript, and Go. With 164 problems translated into these languages, HumanEval-X supports evaluations in code generation and translation, employing functional correctness as the primary metric. CodeGeeX achieved favorable results, outperforming contemporaries such as GPT-J-6B, GPT-NeoX-20B, and InCoder-6.7B in multilingual code generation tasks.

CodeGeeX Applications and User Interaction

CodeGeeX has been integrated into several development environments, including Visual Studio Code and JetBrains, through user-friendly extensions. These tools embody code generation, translation, and explanation features designed to assist programmers, significantly enhancing coding efficiency for a large percentage of users as reported in surveys.

The model's practical utility is evident from its rapid adoption, generating billions of tokens weekly for an active user base. This demonstrates both the reliability and the adaptability of CodeGeeX for real-world programming tasks.

Conclusion and Future Perspectives

While CodeGeeX's multilingual approach to code generation highlights its potential to diversify solutions using various formalized languages, the paper underscores the need for further exploration into model capacity requirements and improved understanding between languages. As the research community continues to explore techniques such as chain-of-thought prompting, the foundational work on CodeGeeX presents a robust platform for both academic inquiry and practical application enhancement in AI-driven code generation.