Revisit Self-Debugging with Self-Generated Tests for Code Generation

Published 22 Jan 2025 in cs.SE and cs.AI | (2501.12793v1)

Abstract: LLMs have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces and formalizes two paradigms for self-debugging in code generation: post-execution, which uses test outputs and errors, and in-execution, which leverages intermediate runtime states.
Experiments show post-execution debugging struggles with self-generated test bias, while in-execution debugging mitigates this bias and improves performance on both basic and complex programming tasks.
Generating accurate self-tests, particularly outputs, remains challenging, and the efficacy of in-execution debugging depends heavily on the language model's code reasoning capabilities.

The paper "Revisit Self-Debugging with Self-Generated Tests for Code Generation" explores the efficacy of self-debugging techniques in LLMs for code generation, focusing on scenarios where high-quality tests are unavailable. The study investigates self-debugging with self-generated tests across diverse programming problems, introducing post-execution and in-execution self-debugging paradigms.

The key contributions and findings are:

The paper introduces and formalizes two distinct paradigms for self-debugging:
- Post-execution self-debugging: This approach validates code correctness by comparing execution outputs with expected outputs. It uses the failed test case, the execution output, and error messages to refine the program, generating a revised version:
  
  $\Tilde{C} = \mathrm{M}(C, X_i, Y_i, \Tilde{Y}_i)$
  
  where:
  - $\Tilde{C}$ is the revised program
  - $\mathrm{M}$ is the LLM
  - $C$ is the initial program
  - $X_i$ is the input for the $i$ -th test
  - $Y_i$ is the expected output for the $i$ -th test
  - $\Tilde{Y}_i$ is the execution output for the $i$ -th test
- In-execution self-debugging: This approach leverages intermediate runtime states during program execution, dividing the program into basic blocks $C = [B^1, B^2, ..., B^K]$ $C = [B^{1}, B^{2}, ..., B^{K}]$ , where $B^k$ $B^{k}$ represents the $k$ $k$ -th basic block and $K$ $K$ is the total number of blocks. For each test input $X_i$ $X_{i}$ , the executor $\mathrm{E}$ $E$ updates the variable set iteratively: $V_i^{k+1} = \mathrm{E}(B^k, V_i^k)$ $V_{i}^{k + 1} = E (B^{k}, V_{i}^{k})$ , where $V_i^k$ $V_{i}^{k}$ denotes the set of variables after executing block $B^k$ $B^{k}$ . The sequence of intermediate states, represented as the execution trace $T=[B^1, V_i^1, ..., B^K, V_i^K]$ $T = [B^{1}, V_{i}^{1}, ..., B^{K}, V_{i}^{K}]$ , provides insights for the LLM $\mathrm{M}$ $M$ to refine the program:
  
  $\Tilde{C}=\mathrm{M}(C, X_i, T)$
  
  where:
  - $\Tilde{C}$ is the updated program
  - $\mathrm{M}$ is the LLM
  - $C$ is the initial program
  - $X_i$ is the test input
  - $T$ is the execution trace
Experimental results on self-contained Python programming tasks from HumanEval, MBPP, and LiveCodeBench, using GPT-4o (2024-05-13), Claude-3.5-Sonnet, Llama-3-70B-Instruct, and Qwen2.5-Coder-7B-Instruct, reveal the following:
- Post-execution self-debugging faces challenges with basic problems like HumanEval and MBPP, but shows potential for improvement on more complex problems in LiveCodeBench.
- The discrepancy is attributed to the bias introduced by self-generated tests, which refers to the misalignment between self-testing labels and true labels. The efficacy of post-execution self-debugging relies on the model's ability to reflect on feedback and recognize faulty feedback.
- In-execution self-debugging minimizes bias by focusing solely on intermediate states during execution, leading to improvements on both basic and competitive tasks.
The paper analyzes the accuracy of self-generated tests, noting that predicting test outputs is more challenging than generating test inputs. For instance, GPT-4o achieves 97.63% input accuracy and 89.77% output accuracy on HumanEval, with an overall test suite accuracy of 59.15%. Similar trends are observed for other models and benchmarks.
The study investigates label changes after the first iteration of self-debugging, observing that self-testing on HumanEval and MBPP is more likely to result in false negative labels, while LiveCodeBench shows more true negative labels due to its challenging problems.
The paper finds that post-execution self-debugging using label feedback leads to improvements across all difficulty levels on LiveCodeBench for GPT-4o. However, including detailed feedback can decrease performance on easier problems.
In-execution self-debugging is found to be a potentially effective method by leveraging runtime execution information on both basic and competitive programming problems. It mitigates the bias introduced by self-generated tests but depends heavily on the LLMs' code reasoning capabilities.
The paper discusses future research directions, including enhancing the quality of LLM-generated tests, implementing iterative refinement processes, and designing sophisticated methods for collecting and analyzing runtime information.