Papers
Topics
Authors
Recent
Search
2000 character limit reached

Revisit Self-Debugging with Self-Generated Tests for Code Generation

Published 22 Jan 2025 in cs.SE and cs.AI | (2501.12793v1)

Abstract: LLMs have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.

Summary

  • The paper introduces and formalizes two paradigms for self-debugging in code generation: post-execution, which uses test outputs and errors, and in-execution, which leverages intermediate runtime states.
  • Experiments show post-execution debugging struggles with self-generated test bias, while in-execution debugging mitigates this bias and improves performance on both basic and complex programming tasks.
  • Generating accurate self-tests, particularly outputs, remains challenging, and the efficacy of in-execution debugging depends heavily on the language model's code reasoning capabilities.

The paper "Revisit Self-Debugging with Self-Generated Tests for Code Generation" explores the efficacy of self-debugging techniques in LLMs for code generation, focusing on scenarios where high-quality tests are unavailable. The study investigates self-debugging with self-generated tests across diverse programming problems, introducing post-execution and in-execution self-debugging paradigms.

The key contributions and findings are:

  • The paper introduces and formalizes two distinct paradigms for self-debugging:
    • Post-execution self-debugging: This approach validates code correctness by comparing execution outputs with expected outputs. It uses the failed test case, the execution output, and error messages to refine the program, generating a revised version:

      $\Tilde{C} = \mathrm{M}(C, X_i, Y_i, \Tilde{Y}_i)$

      where:

      • $\Tilde{C}$ is the revised program
      • M\mathrm{M} is the LLM
      • CC is the initial program
      • XiX_i is the input for the ii-th test
      • YiY_i is the expected output for the ii-th test
      • $\Tilde{Y}_i$ is the execution output for the ii-th test
    • In-execution self-debugging: This approach leverages intermediate runtime states during program execution, dividing the program into basic blocks C=[B1,B2,...,BK]C = [B^1, B^2, ..., B^K], where BkB^k represents the kk-th basic block and KK is the total number of blocks. For each test input XiX_i, the executor E\mathrm{E} updates the variable set iteratively: Vik+1=E(Bk,Vik)V_i^{k+1} = \mathrm{E}(B^k, V_i^k), where VikV_i^k denotes the set of variables after executing block BkB^k. The sequence of intermediate states, represented as the execution trace T=[B1,Vi1,...,BK,ViK]T=[B^1, V_i^1, ..., B^K, V_i^K], provides insights for the LLM M\mathrm{M} to refine the program:

      $\Tilde{C}=\mathrm{M}(C, X_i, T)$

      where:

      • $\Tilde{C}$ is the updated program
      • M\mathrm{M} is the LLM
      • CC is the initial program
      • XiX_i is the test input
      • TT is the execution trace
  • Experimental results on self-contained Python programming tasks from HumanEval, MBPP, and LiveCodeBench, using GPT-4o (2024-05-13), Claude-3.5-Sonnet, Llama-3-70B-Instruct, and Qwen2.5-Coder-7B-Instruct, reveal the following:
    • Post-execution self-debugging faces challenges with basic problems like HumanEval and MBPP, but shows potential for improvement on more complex problems in LiveCodeBench.
    • The discrepancy is attributed to the bias introduced by self-generated tests, which refers to the misalignment between self-testing labels and true labels. The efficacy of post-execution self-debugging relies on the model's ability to reflect on feedback and recognize faulty feedback.
    • In-execution self-debugging minimizes bias by focusing solely on intermediate states during execution, leading to improvements on both basic and competitive tasks.
  • The paper analyzes the accuracy of self-generated tests, noting that predicting test outputs is more challenging than generating test inputs. For instance, GPT-4o achieves 97.63% input accuracy and 89.77% output accuracy on HumanEval, with an overall test suite accuracy of 59.15%. Similar trends are observed for other models and benchmarks.
  • The study investigates label changes after the first iteration of self-debugging, observing that self-testing on HumanEval and MBPP is more likely to result in false negative labels, while LiveCodeBench shows more true negative labels due to its challenging problems.
  • The paper finds that post-execution self-debugging using label feedback leads to improvements across all difficulty levels on LiveCodeBench for GPT-4o. However, including detailed feedback can decrease performance on easier problems.
  • In-execution self-debugging is found to be a potentially effective method by leveraging runtime execution information on both basic and competitive programming problems. It mitigates the bias introduced by self-generated tests but depends heavily on the LLMs' code reasoning capabilities.
  • The paper discusses future research directions, including enhancing the quality of LLM-generated tests, implementing iterative refinement processes, and designing sophisticated methods for collecting and analyzing runtime information.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.