ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Published 3 Aug 2023 in cs.CL and cs.AI | (2308.01861v2)

Abstract: In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.

Abstract PDF Upgrade to Chat

Citations (100)

View on Semantic Scholar

Summary

The paper introduces ClassEval, the first benchmark specifically designed to assess class-level code generation in LLMs.
It evaluates 11 prominent models using holistic, incremental, and compositional strategies, revealing a notable drop in performance compared to method-level tasks.
The study highlights the need for enhanced LLM training on complex, interdependent code structures to improve real-world software development capabilities.

Evaluating LLMs on Class-Level Code Generation with ClassEval

The rapidly evolving field of LLMs presents promising advancements in code generation capabilities. Recent studies predominantly focus on function-level or statement-level code generation, often represented by benchmarks such as HumanEval. However, these benchmarks do not fully capture the intricacies involved in generating structured, multi-method codes like classes. The paper "ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation" proposes addressing this gap by introducing ClassEval, the first benchmark specifically designed to evaluate class-level code generation tasks.

ClassEval provides a test suite and canonical implementations for 100 manually constructed class-level Python coding tasks, covering diverse topics such as management systems and game development. Constructed over approximately 500 person-hours, ClassEval challenges models to generate classes comprising multiple interdependent methods. This complex context is intended to replicate real-world software development scenarios, where code units are not isolated but interact with each other at various levels.

Empirical Evaluations

The study evaluates 11 prominent LLMs within the ClassEval framework, assessing their performance using three distinct generation strategies: holistic generation, incremental generation, and compositional generation. Notably, GPT-4 and GPT-3.5 demonstrated superior performance, with an observable decline in performance across all models when tasked with class-level code, as opposed to method-level benchmarks like HumanEval. Class-level Pass@1 rates for GPT models were notably lower (37.0% for GPT-4 and 27.0% for GPT-3.5) than method-level tasks due to increased complexity. This substantial dip highlights the limitations in translating function-level proficiency to class-level contexts.

Among different generation strategies, the results indicate that holistic generation performed best for models like GPT-4 and GPT-3.5, which excel at incorporating extensive context. In contrast, other models benefited more from incremental and compositional strategies, likely due to challenges in processing long contextual instructions inherent in holistic approaches. The study reveals interesting insights, such as the ability of models to handle field dependencies more effectively than method dependencies, indicating where future model training efforts might be directed.

Implications and Future Work

This work suggests that while LLMs have advanced in their ability to generate method-level code, the same cannot be assumed for the more sophisticated task of class-level code generation. The findings highlight that the method-level coding abilities of LLMs are not adequate indicators of their capabilities in generating class-level code effectively. Furthermore, the results provide valuable insights into which generation strategies may benefit specific types of LLMs, based on their capability profiles.

The ClassEval benchmark opens avenues for developing more robust LLMs that can handle complex coding tasks involving multiple interdependencies within a class structure. Future research could explore enhancing LLM architectures to better process long contextual inputs and improve understanding and integration of interdependent code structures. A focus on these areas may yield models capable of performing class-level code generation with the same adeptness seen in simpler code-generation tasks. This study shines a light on the importance of benchmark diversification in fully assessing LLM capabilities and steering model advancement in practical application domains.