Exploring Data-Efficient Adaptation of Large Language Models for Code Generation

Published 29 Feb 2024 in cs.SE, cs.AI, and cs.CL | (2403.00046v3)

Abstract: Although LLMs have made significant progress in code generation, they still struggle with code generation tasks in specific scenarios. These scenarios usually necessitate the adaptation of LLMs to fulfill specific needs, but the limited training data available in practice leads to poor code generation performance. Therefore, how to effectively adapt LLMs to new scenarios with few training data is a major challenge for current code generation. In this paper, we propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation. DEED leverages the errors made by LLMs as learning opportunities, using error revision to overcome their own shortcomings, thus achieving efficient learning. Specifically, DEED involves identifying error code generated by LLMs, employing Self-Revise for code revision, optimizing the model with revised code, and iteratively adapting the process for continuous improvement. Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data, showing an average relative improvement of 46.2% in Pass@1 on multiple code generation benchmarks. We also validate the effectiveness of Self-Revise, which generates revised code that optimizes the model more efficiently compared to the code samples from datasets. Moreover, DEED consistently demonstrates strong performance across various LLMs, underscoring its applicability.

Abstract PDF HTML Upgrade to Chat

References (41)

Citations (2)

View on Semantic Scholar

Summary

The paper's main contribution is the DEED approach, which employs error-driven learning to enhance LLM performance with limited training data.
The method iteratively collects errors, automatically revises codes, and fine-tunes models, achieving superior results on benchmarks such as MBPP and HumanEval.
The approach outperforms traditional fine-tuning techniques, offering a promising trajectory for efficient model adaptation in data-scarce environments.

Exploring Data-Efficient Adaptation of LLMs for Code Generation

Introduction

The field of code generation, which leverages LLMs to translate human requirements expressed in natural language into executable code, has witnessed substantial advancements. Despite these advancements, LLMs often face challenges when dealing with specific scenarios, particularly when training data is limited due to industry constraints or resource scarcity. This limitation leads to suboptimal performance, highlighting the necessity for effective adaptation techniques. The paper "Exploring Data-Efficient Adaptation of LLMs for Code Generation" (2403.00046) introduces a novel adaptation method termed Data-Efficient adaptation with Error-Driven learning (hereafter referred to as DEED) aimed at enhancing LLM performance even with scarce training data.

Methodology

DEED leverages an error-driven learning approach to optimize LLMs through four iterative steps: Error Code Collection, Automatic Code Revision, Model Optimization, and Iterative Adaptation. The process begins by identifying erroneous outputs from LLMs and using these errors as learning opportunities to refine models, ultimately improving their performance with minimal data. This strategy differs from traditional fine-tuning methods by focusing on critical error revisions instead of exhaustive learning from complete datasets.

Figure 1: An overview of the proposed DEED and its differences from traditional fine-tuning methods.

Error Code Collection

The initial step involves collecting error codes generated by LLMs, using rejection sampling guided by test criteria. The model-generated outputs that fail to meet specified conditions are identified as error codes, providing insights into the weaknesses of LLMs.

Automatic Code Revision

Automatic Code Revision is central to DEED's approach, wherein erroneous codes are revised using a method named . This involves combining various inputs, including requirements, error codes, test feedback, and correct solutions from datasets, to generate revised codes that overcome identified errors.

Figure 2: Illustration of automatic code revision.

Model Optimization

Revised codes are then utilized to optimize the base model through fine-tuning, enabling the model to focus on learning corrections from critical errors. This iterative adaption enhances the model's proficiency in specific scenarios with limited data.

Iterative Adaptation

The iterative process continues until successive rounds yield diminishing improvements, ensuring a stable model training process by leveraging data from previous iterations.

Evaluation

The paper presents extensive evaluations across multiple public code generation benchmarks: HumanEval, MBPP, and DataScience, among others. DEED demonstrates considerable relative improvements across these datasets, outperforming mainstream adaptation methods like full-parameter and LoRA fine-tuning, as well as prompting techniques.

Figure 3: The performance of direct generation, fine-tuning, and DEED on MBPP dataset under the circumstance of limited data. The numbers on the bars indicate the training data used by different methods.

Implications and Future Work

DEED's approach of using error revision offers a promising avenue for efficiently adapting LLMs to new scenarios without the need for large datasets. The results indicate potential applications in industries where training samples are scarce. Additionally, incorporating LLMs' revisions into training processes could redefine adaptation techniques across various domains.

In future work, exploring the applicability of DEED in project-level code generation with domain-specific evaluations could further enhance its robustness and expand its utility in real-world applications. Improved methods for test case generation to facilitate the error-driven learning process could also strengthen model adaptation strategies.

Figure 4: Performance analysis with varying sizes of training data on MBPP dataset.

Conclusion

The DEED approach significantly refines code generation performance under limited data conditions, offering an effective and efficient trajectory for model adaptation. By harnessing the potential of error-driven learning, DEED sets a precedent for future explorations in enhancing LLM efficacy, particularly in data-constrained environments. Its ability to consistently improve performance across different LLMs underscores the versatility and applicability of this method in advancing artificial intelligence capabilities.

Markdown Report Issue