CodeJudge: Evaluating Code Generation with Large Language Models

Published 3 Oct 2024 in cs.LG, cs.CL, and cs.SE | (2410.02184v1)

Abstract: LLMs have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing "slow thinking" to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a framework that evaluates generated code without traditional test cases by assessing semantic correctness.
It introduces 'slow thinking' techniques to guide LLMs into thorough reasoning, enhancing the evaluation process.
The framework was validated across diverse LLMs, datasets, and programming languages, outperforming larger existing models.

The paper "CodeJudge: Evaluating Code Generation with LLMs" explores the challenge of evaluating code generated by LLMs, which has become increasingly critical as LLMs are used for programming tasks. Traditional methods for code evaluation often rely on test cases, which may not adequately capture the semantic correctness of code. This paper introduces a novel framework called CodeJudge, aiming to improve the evaluation process by leveraging LLMs themselves.

Key Contributions:

Code Evaluation without Test Cases: CodeJudge offers a method to evaluate code without relying on pre-existing or manually crafted test cases. Instead, it uses LLMs to assess the semantic correctness of the code, which allows for more flexible and potentially more accurate evaluations.
"Slow Thinking" Strategies: The paper investigates techniques to enhance the reasoning capabilities of LLMs during the evaluation process, termed "slow thinking." These techniques are designed to guide LLMs to perform thorough and thoughtful analyses, improving the reliability of their evaluations.
Diverse Evaluation Settings: The authors experimented with four different LLMs as evaluators. They tested the framework across four code generation datasets and five programming languages, demonstrating the versatility and robustness of CodeJudge in various programming scenarios.
Comparison with Existing Methods: CodeJudge was compared against state-of-the-art methods, including a GPT-3.5-based evaluator. Remarkably, CodeJudge outperformed these existing methods even when utilizing a smaller model, Llama-3-8B-Instruct. This highlights the efficiency and effectiveness of the proposed framework in code evaluation tasks.

Practical Implications:

The results indicate that CodeJudge has the potential to significantly advance code evaluation methods, offering more refined and accurate assessments without the heavy reliance on test cases. The framework could be particularly useful in educational settings, automated code review systems, and other applications where assessing code semantic correctness is crucial.

The availability of the code and datasets on GitHub provides an opportunity for further research and development, allowing others to explore and build on this innovative approach.