Papers
Topics
Authors
Recent
Search
2000 character limit reached

Disproving Program Equivalence with LLMs

Published 5 Feb 2025 in cs.SE and cs.LG | (2502.18473v1)

Abstract: To evaluate LLMs for code, research has used manually created unit test-based benchmarks. However, these tests are often inadequate, missing corner cases and other implementation-specific oddities. This work introduces ProbeGen, a whitebox method that takes two or more executable pieces of code and searches for counterexamples to their equivalence. Comparing code semantics requires a deep understanding of code. We demonstrate that LLMs with execution feedback perform well at this task. In a common code synthesis benchmark, ProbeGen disproves 18% of samples considered equivalent to the ground truth by the benchmark-provided unit tests. Additionally, using ProbeGen, we can semantically cluster LLM samples for semantic self-consistency, improving pass@1 by 10% by unifying syntactically distinct but semantically similar samples.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.