HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Published 30 Dec 2024 in cs.SE and cs.CL | (2412.21199v2)

Abstract: We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper introduces novel benchmarks to assess LLMs on self-invoking code tasks, expanding traditional evaluation methods.
It demonstrates significant performance drops, with models like o1-mini falling from 96.2% to 76.2% pass rates on self-invoked tasks.
The study identifies key failure modes and suggests future research directions to enhance recursive reasoning and multi-step coding performance.

Evaluating LLMs on Self-Invoking Code Generation

The paper "HumanEval Pro and MBPP Pro: Evaluating LLMs on Self-invoking Code Generation" explores a novel dimension in the evaluation of LLMs: their ability to engage in self-invoking code generation. The authors introduce self-invoking code generation as a task to assess the progressive reasoning and problem-solving capabilities of LLMs, highlighting the intricacies involved in such processes compared to traditional code generation tasks.

Summary of Contributions

The paper contributes to the field through three main avenues:

Introduction of New Benchmarks: The researchers propose benchmarks—HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro—that build upon existing datasets by introducing more complex, self-invoking tasks. These benchmarks are carefully curated to rigorously test LLMs' abilities to invoke previously generated functions to solve related, more intricate problems.
Analysis of LLM Performance: Experimental evaluation is conducted on a comprehensive set of over 20 LLMs, revealing a notable discrepancy in performance between traditional code generation tasks and self-invoking tasks. The study underscores the underperformance of models such as o1-mini, which exhibits a stark drop from 96.2% pass rate on HumanEval to 76.2% on HumanEval Pro, demonstrating the challenge of self-invocation.
Identification of Failure Modes: The research identifies distinct failure modes within LLM outputs on these benchmarks, such as assertion errors and undefined references, which frequently hinder successful task completion. The paper suggests that instruction-tuned models offer only marginal improvements over base models, particularly in self-invoking contexts, highlighting a gap for further research.

Implications for Future Research

This study opens avenues for advancing LLM design and training methodologies. By pinpointing the gap in handling self-invoking code generation, the paper highlights the need for models that are better equipped at autonomously managing context and applying learned solutions to novel problem spaces. This suggests future work could focus on improving the reasoning capabilities intrinsic to LLMs, perhaps through enhanced training regimens or architectural modifications geared specifically towards recursion and multi-step reasoning.

Additionally, the promising but limited gains from instruction-tuned models suggest that alternative approaches might be necessary to achieve substantial improvements in self-invoking tasks. Techniques such as iterative learning with dynamic memory, self-reflection, or leveraging more sophisticated error correction mechanisms could be potential research directions.

Practical Applications

From a practical standpoint, advancements in solving self-invoking tasks could lead to more robust automated software engineering tools, significantly enhancing developers' efficiency by enabling better function synthesis and optimization in complex project environments. Such models could transition from simple auto-completions to sophisticated collaborative coding partners that understand and integrate within the broader coding context. This transition could profoundly impact the workflow in large-scale software development settings, contributing to more efficient, error-resistant code creation and maintenance.

Conclusion

The findings of this paper represent a significant step towards a more nuanced understanding of LLM code generation capabilities, revealing fundamental limitations in current models' reasoning abilities. By focusing on self-invoking tasks, this research highlights critical areas requiring innovation, ensuring that future models are more adept and versatile in handling complexities akin to those encountered in real-world applications.