Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code

Published 11 Oct 2023 in cs.SE | (2310.10508v2)

Abstract: The rapid advancements in LLMs have greatly expanded the potential for automated code-related tasks. Two primary methodologies are used in this domain: prompt engineering and fine-tuning. Prompt engineering involves applying different strategies to query LLMs, like ChatGPT, while fine-tuning further adapts pre-trained models, such as CodeBERT, by training them on task-specific data. Despite the growth in the area, there remains a lack of comprehensive comparative analysis between the approaches for code models. In this paper, we evaluate GPT-4 using three prompt engineering strategies -- basic prompting, in-context learning, and task-specific prompting -- and compare it against 17 fine-tuned models across three code-related tasks: code summarization, generation, and translation. Our results indicate that GPT-4 with prompt engineering does not consistently outperform fine-tuned models. For instance, in code generation, GPT-4 is outperformed by fine-tuned models by 28.3% points on the MBPP dataset. It also shows mixed results for code translation tasks. Additionally, a user study was conducted involving 27 graduate students and 10 industry practitioners. The study revealed that GPT-4 with conversational prompts, incorporating human feedback during interaction, significantly improved performance compared to automated prompting. Participants often provided explicit instructions or added context during these interactions. These findings suggest that GPT-4 with conversational prompting holds significant promise for automated code-related tasks, whereas fully automated prompt engineering without human involvement still requires further investigation.

Abstract PDF Upgrade to Chat

Citations (25)

View on Semantic Scholar

Summary

The paper demonstrates that task-specific prompt engineering can outperform basic prompting in code summarization, achieving an 8.33% BLEU improvement over fine-tuned models in some cases.
Conversational prompting with human feedback enhances GPT-4 outputs, improving code comment generation BLEU scores by approximately 15.8% compared to automated methods.
The study highlights trade-offs between prompt engineering and fine-tuning, urging developers to balance performance benefits against setup costs and computational demands.

Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code

Overview

This paper presents a comprehensive empirical assessment of GPT-4 against fine-tuned models in automated software engineering (ASE) tasks like code generation, summarization, and translation. Through both quantitative and qualitative analyses, the paper evaluates the efficacy of three prompt engineering techniques—basic, in-context, and task-specific prompts—compared to 18 fine-tuned LLMs.

Figure 1: Overall workflow of our quantitative and qualitative studies.

Quantitative Analysis

Quantitative comparisons were performed using standard benchmarks including CodeXGLUE, HumanEval, and MBPP datasets. Results indicated that basic prompts with GPT-4 provided improvements in specific scenarios like code summarization, notably outperforming top-ranked fine-tuned models by 8.33% in BLEU scores. However, fine-tuned models notably excelled over GPT-4 in other tasks, such as code translation from Java to C#, with the best fine-tuned models achieving a 29.69% higher BLEU score on average compared to GPT-4's basic prompting.

Qualitative Analysis

The study incorporated feedback from 37 participants, including industry practitioners and academia, to evaluate GPT-4's responses with conversational prompts. The analysis revealed that conversational interaction notably improves GPT-4's output across ASE tasks. For instance, in comment generation, conversational prompts enhanced BLEU scores by approximately 15.8% over automated task-specific prompts. This underscores the potential of human-in-loop methodologies to refine LLM outputs in code-related tasks.

Findings and Recommendations

The paper provides several insights and practical recommendations:

Prompt Strategy Efficacy: Task-specific engineering prompts outperformed basic and in-context prompts for GPT-4 in numerous instances, especially in niche areas of code summarization and translation.
Conversational Prompting: Human feedback remains paramount to optimizing LLM outputs for ASE tasks, indicating that conversational models may eventually serve as a standard for maximizing output quality.
Trade-offs: While prompt engineering offers flexibility and lower setup costs, fine-tuned models generally deliver superior performance in computationally heavy tasks. Developers and researchers must weigh these trade-offs based on accuracy, cost, ease of use, and control.

Implications and Future Work

The paper substantiates that while GPT-4 holds substantial promise for enhancing ASE, prompt engineering techniques require further refinement. Automated prompt engineering leveraging Reinforcement Learning and systematic pattern extraction from developer feedback could potentially bridge current gaps. Future research will also need to verify these results across wider datasets and in more diverse real-world ASE scenarios.

Conclusion

Overall, the study underscores that though conversational and task-specific prompts with LLMs like GPT-4 can surpass traditional fine-tuned approaches in specific ASE tasks, substantial gaps in fully automated LLM performance remain. The paper’s findings suggest a hybrid future where human interaction loops and advanced prompt engineering form core components of ASE workflows.