Generalizability of template-based LLM-assisted mechanized proof development

Ascertain the degree to which template-based, LLM-assisted mechanized proof development generalizes to settings where one or more enabling preconditions are absent, specifically: the absence of a closely related, complete proof serving as a template; the absence of a domain expert able to identify correspondences and provide targeted guidance; or the absence of a sufficiently capable large language model. Evaluate performance and reliability when adapting proofs in contexts such as transformations without close templates or domains where human expertise is limited.

Background

The paper reports a case study where an LLM (Claude Code) adapted a known proof technique from the CPS transformation to prove correctness of the ANF transformation in the CertiCoq compiler. The success of the experiment relied on several preconditions, including the existence of a closely related proof as a template, the availability of a domain expert to guide the process, and the capability of the LLM.

While the approach succeeded under these conditions, the authors explicitly note uncertainty about how well such template-driven, LLM-assisted proof development would perform when one or more of these preconditions are missing—for example, when no similarly structured proof exists to guide adaptation or when expert guidance is limited. They characterize this uncertainty as an important open question regarding generalizability.

References

The success of our experiment depends on several preconditions: the existence of a closely related, complete proof serving as a template; a domain expert able to identify correspondences and provide targeted guidance; and a sufficiently capable LLM. How well this approach generalizes to settings where one or more of these preconditions are absent, e.g., proofs without a close template, or domains where the human lacks deep expertise, is an important open question.

Machine-Generated, Machine-Checked Proofs for a Verified Compiler (Experience Report)  (2602.20082 - Paraskevopoulou, 23 Feb 2026) in Section 6 (Discussion), Generalizability