Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach

Published 11 May 2025 in cs.SE | (2505.06880v1)

Abstract: Code LLMs (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated benchmarks. However, there is a substantial gap between real-world scenarios and benchmark settings. Existing benchmarks typically provide only a single input prompt for the evaluation of each synthesis problem. However, in practice, a problem can be described in various ways, including with typos, where developers may struggle to understand certain descriptions and seek clarification to find more suitable wording. Such various descriptions may lead to variations in the performance of CLLMs on the same question, resulting in a biased evaluation when using existing benchmarks. In this paper, we aim to explore these pitfalls with the goal of revisiting and enhancing future benchmark designs. To simulate real-world variations in problem descriptions, we propose 10 mutation strategies and introduce three new metrics to evaluate their impact on code generation. We then assess five popular CLLMs using 12,834 generated prompt variants, and found a significant performance discrepancy between the results from existing benchmarks and those from mutated benchmarks containing perturbations and variations. This finding underscores the need for more robust evaluation methods and benchmarks.

Abstract PDF Upgrade to Chat

Summary

Evaluation of Code Generation Benchmarks: Exploring Mutation Strategies

The paper "Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach" addresses the limitations of current methods for evaluating Code Large Language Models (CLLMs) in the context of program synthesis. The authors critique the prevalent use of static, manually curated benchmarks that employ singular input prompts for each coding problem, which inadequately reflect the variety of real-world scenarios where problems can be presented in diverse ways. This inherent limitation can lead to discrepancies between reported and actual performance of CLLMs when confronted with practical usage.

The authors introduce a novel evaluation framework employing mutation strategies intended to simulate real-world variations and perturbations in input prompts. Their approach involves creating input variants through methods such as typo simulation, synonym substitution, paraphrasing, summarization, and example manipulation. Specifically, the paper introduces ten mutation strategies and three new metrics: Correctness Variability (CV), Mutation Bias (MB), and Best Pass@k (Pass@k_b), to gauge how these mutations affect CLLM performance in a more nuanced manner compared to existing benchmarks like HumanEval.

Key findings highlight significant inconsistencies between the performance of CLLMs evaluated with traditional benchmarks and those using mutated input prompts. Notably, even slight modifications in the phrasing of problem descriptions can lead to substantial differences in the performance of various models, with certain CLLMs showing improvement or decline depending on specific types of mutations. For instance, the paper describes how typos in variable names can sometimes enhance, rather than detract from, the performance of models.

The analysis, conducted on five popular CLLMs – namely, DeepSeek, Llama3.1, CodeLlama, CodeGen, and InCoder – involving 12,834 prompt variations and employing 10 different mutation strategies, brings to light several insights. A key observation is that higher-performing models demonstrate greater susceptibility to fluctuations in performance when exposed to mutations in problem descriptions, indicating a reliance on specific formulations of inputs. Conversely, poorly performing models seem more sensitive to changes in function and variable names, suggesting varying degrees of semantic understanding.

This study's implications are far-reaching for both the practical applications and theoretical understanding of CLLMs. By revealing the embedded biases in existing evaluation methodologies and highlighting the need for more comprehensive benchmarks that incorporate a wider range of prompt variations, this research advocates for evolving the evaluation landscape of code synthesis tasks. Future developments could incorporate an iterative approach where models are assessed on their ability to synthesize code only after confirming understanding through corrective feedback loops, potentially leading to more robust assessments of model capabilities that better mirror real-world performance scenarios.

This paper is an important contribution to the field, underscoring the necessity for rigorous assessment techniques that account for the variability of natural language and code language ambiguities that CLLMs encounter. As research in artificial intelligence and code synthesis progresses, incorporating these nuanced evaluation methodologies could pave the way for more equitable model comparisons and encourage development efforts across diverse model architectures and training paradigms.