Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation

Published 17 Feb 2025 in cs.SE | (2502.11620v3)

Abstract: In this work, we explore uncertainty estimation as a proxy for correctness in LLM-generated code. To this end, we adapt two state-of-the-art techniques from natural language generation -- one based on entropy and another on mutual information -- to the domain of code generation. Given the distinct semantic properties of code, we introduce modifications, including a semantic equivalence check based on symbolic execution. Our findings indicate a strong correlation between the uncertainty computed through these techniques and correctness, highlighting the potential of uncertainty estimation for quality assessment. Additionally, we propose a simplified version of the entropy-based method that assumes a uniform distribution over the LLM's responses, demonstrating comparable effectiveness. Using these techniques, we develop an abstention policy that prevents the model from making predictions when uncertainty is high, reducing incorrect outputs to near zero. Our evaluation on the LiveCodeBench shows that our approach significantly outperforms a baseline relying solely on LLM-reported log-probabilities.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that uncertainty estimation techniques can reliably measure correctness in LLM-generated code.
It adapts entropy and mutual information methods with semantic clustering and symbolic execution to capture nuanced code behavior.
Experimental results on LiveCodeBench show a strong negative correlation between uncertainty estimates and correctness, outperforming baseline methods.

Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation

Introduction

The paper "Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation" (2502.11620) investigates the application of uncertainty estimation to evaluate correctness in code generated by LLMs. This work extends existing methodologies from natural language generation (NLG) to the domain of code, where unique challenges arise due to the distinct properties of programming languages.

Methodology

The authors adapt two state-of-the-art uncertainty estimation techniques, one based on entropy and another on mutual information, to the domain of code generation. These adaptations involve modifications that cater to the semantic nuances of code, such as implementing a semantic equivalence check using symbolic execution.

Entropy-Based Technique

In the entropy-based approach, the uncertainty of LLM-generated code is estimated by calculating the entropy of semantic clusters of code snippets. Semantic clustering is achieved through symbolic execution, which groups snippets based on equivalence in behavior rather than syntax. The paper introduces a simplified version of this technique that assumes a uniform probability distribution over responses, reducing reliance on LLM-reported log probabilities while maintaining effectiveness.

Mutual Information-Based Technique

The mutual information-based method differentiates between epistemic and aleatoric uncertainty by iterative prompting and clustering responses into semantic equivalence classes. This technique uniquely isolates epistemic uncertainty, helping to gauge the model's confidence and correctness by assessing dependencies between iterative responses.

Experimental Evaluation

The proposed techniques were evaluated on the LiveCodeBench dataset, specifically chosen for its contamination-free nature, ensuring that the problems have not been exposed during the training of contemporary LLMs. Results demonstrated that both modified techniques outperform a baseline that relies solely on the LLMs' log probabilities. The adaptation of these methods to code generation also showed a consistent negative correlation between uncertainty and correctness scores, reinforcing the validity of uncertainty as a proxy for correctness.

Implications and Future Work

The findings suggest that uncertainty estimation methods tailored for code generation are more reliable than those initially designed for NLG. This advancement provides a potential enhancement for quality assessment in LLM-generated code, offering a systematic approach to rejecting outputs that fall below a predefined correctness threshold and thus minimizing incorrect outputs. Future work could explore further refinements to these techniques, possibly integrating other forms of semantic analysis or extending them to handle more complex or ambiguous coding tasks.

Conclusion

This paper pioneers the adaptation of uncertainty estimation techniques for code generation, demonstrating that these methods can serve as effective proxies for correctness without the need for external oracles. The promising results pave the way for further exploration and optimization in leveraging uncertainty estimation for improving the reliability and safety of LLM-generated code.

Markdown Report Issue