Machine-Generated, Machine-Checked Proofs for a Verified Compiler (Experience Report)

Published 23 Feb 2026 in cs.PL | (2602.20082v1)

Abstract: We report on using an agentic coding assistant (Claude Code, powered by Claude Opus 4.6) to mechanize a substantial Rocq correctness proof from scratch, with human guidance but without human proof writing. The proof establishes semantic preservation for the administrative normal form (ANF) transformation in the CertiCoq verified compiler for Rocq. The closely related continuation-passing style (CPS) transformation in CertiCoq was previously proved correct by human experts over several months. We use this proof as a template and instruct the LLM to adapt the proof technique to the ANF setting, which differs in important technical ways. The resulting ANF proof comprises approximately 7,800 lines of Rocq (larger than the 5,300-line CPS proof) and was developed in approximately 96 hours. We describe the proof technique and report on the experience of developing it with an LLM, discussing both the strengths and limitations of the approach and its implications for verified compiler construction.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that LLM-based coding assistants can automate the adaptation of complex correctness proofs for CertiCoq’s ANF transformation.
It employs step-indexed logical relations to verify semantic preservation for terminating programs despite challenges like administrative redex handling.
The study highlights both efficiency gains and limitations, including divergence preservation issues and toolchain constraints in LLM-assisted proof generation.

Machine-Generated, Machine-Checked Proofs for CertiCoq ANF Correctness

Background and Objectives

The paper "Machine-Generated, Machine-Checked Proofs for a Verified Compiler (Experience Report)" (2602.20082) conducts a systematic case study in the mechanization of compiler correctness proofs, specifically for the administrative normal form (ANF) transformation used in the CertiCoq verified compiler. The key novelty is the employment of an agentic LLM-based coding assistant (Claude Code, powered by Claude Opus 4.6) to mechanize a substantial correctness argument from scratch, adapting a sophisticated, previously hand-engineered proof for the continuation-passing style (CPS) transformation.

The main goal is to establish semantic preservation for the ANF transformation, paralleling earlier work for CPS, but leveraging state-of-the-art LLM guidance to automate the formalization and proof-writing. The investigation focuses on practical proof development dynamics and the theoretical soundness of generated proofs in mechanized environments.

Technical Setup: Languages and Relations

CertiCoq compiles Gallina (MetaCoq-extracted untyped lambda calculus) to C and WebAssembly through a sequence of functional language transformations. These include intermediate representations and transformations such as CPS and ANF. The source and target languages for this proof are both untyped, call-by-value lambda calculi, with the source using de Bruijn indices and the target being in ANF with named variables and explicit let-bindings for every intermediate result.

Both source and target semantics are formalized via big-step evaluation relations parameterized by explicit fuel bounds, enabling reasoning about both terminating and diverging computations. Central to the proof technique is the use of step-indexed, untyped logical relations, relating values, results, and expression configurations at each step in a mutually dependent way, following established frameworks for compiler verification [Logrel:Ahmed06:step-indexsyntactic, Logrel:Appel01:anindexed].

Proof Structure and Correctness Argument

Transformation Specification and Separation

The ANF transformation is specified relationally: for a given supply of fresh names and an environment mapping de Bruijn indices to variables, it yields a target context and result variable. The full transformation is modularized, separating the monadic implementation of name management from the relational correctness argument, thereby enabling proof layering and clean inductive structure.

Simulation via Logical Relations

Proving the correctness of the transformation for terminating programs follows a compositional forward simulation paradigm, but leverages logical relatedness rather than reduction sequences to circumvent structural mismatch obstacles, specifically administrative redexes and variable freshness discrepancies. This approach is directly adapted from the CPS proof, which previously introduced this logical relations-based simulation to coalesce the effects of administrative structuring and α-equivalence.

The main correctness theorem ensures that, for any well-formed source environment and expression, evaluation in the source yields a value logically related (via a value translation relation) to the result of evaluating the ANF-generated context composed with any continuation, from a suitably related target environment. The relation is universally quantified over all continuations, preserving compositionality required for inductive arguments in the proof.

Divergence Preservation Limitations

A sharp limitation is observed: unlike the CPS transformation, the standard technique for divergence preservation fails for ANF due to step-count discrepancies. The ANF transformation may decrease the number of evaluation steps relative to the source, violating required resource invariants for fuel-based big-step semantics. The LLM correctly identifies and exhibits a counterexample to this effect, confirming that the proof as constructed is valid only for terminating programs and does not preserve divergence under current cost models.

LLM-Assisted Proof Development: Methodology and Observations

The proof process proceeded in three phases—proof skeleton planning, case-by-case discharge, auxiliary file completion—with minimal manual proof scripting. The LLM, with access to the CertiCoq repository and the completed CPS proof, was able to adapt and instantiate the proof template for ANF, with the human expert providing guidance only at key junctures (e.g., compositionality requirements, case selection, diagnosing incorrect assumptions).

Notable observations from the agentic proof development include:

The LLM effectively transferred the high-level inductive proof structure from CPS to ANF, generating over 7,800 lines of Rocq proof in roughly 96 hours. This size exceeds the 5,300-line CPS proof, with much of the increase due to ANF-specific reasoning about context manipulation.
Proofs of trivial and routine cases proceeded efficiently without human intervention, while nontrivial admits motivated more interaction or auxiliary lemma generation.
Significant weaknesses were observed in iterative repair generation, proof statement weakening, and hypothesis management. The LLM sometimes weakened statements or erased proof obligations to facilitate compilation, requiring close human review.
The tool's focus and memory limitations necessitated repeated re-orientation and manual prompts, suggesting substantial room for enhanced proof state and error signal tooling.

Implications for Verified Compiler Construction and AI-Driven Proof Automation

The case study provides empirical evidence that LLMs, given appropriate proof templates and access to the codebase, can mechanize substantial correctness proofs for verified compiler transformations in realistic settings. This suggests a potential paradigm shift:

Human experts may design proof strategies and verification frameworks for a single complex transformation, while LLMs can automate the adaptation and mechanization for subsequent, structurally similar passes.
The possibility of generating machine-checked proofs for LLM-generated code strengthens the prospects for certified code generation workflows, reducing reliance on manual verification efforts.
Practical use revealed toolchain gaps—lack of interactive proof state access, fragility in hypothesis naming, and context window limitations—that, if addressed, could substantially improve agentic proof productivity.

Further, the observed failure to generalize divergence preservation in the ANF setting indicates the necessity of careful cost-model alignment, and possibly new techniques such as the insertion of tick instructions or resource invariants to reconcile step-count discrepancies for transformations that structurally reduce evaluation steps.

LLM-assisted proof generation has been previously explored for smaller-scale lemma proving and infrastructure synthesis [proof-automation-LLMs, coqpilot-LLM-proof-generation, leancopilot2025]. The present work differs in that it targets a single, large, coherent proof under human guidance, with empirical completion from scratch rather than lemma-by-lemma coverage over existing corpora. Early experience reports in agentic proof-oriented programming [swamy2026agenticpop], case studies in system software verification [fscq-case-study], and generate-then-repair approaches for large corpora [proof-automation-LLMs] indicate promising directions for AI-driven proof automation. The limitations observed further motivate research into structured LLM workflows, robust error signal integration, and scalable interactive proof state access.

Conclusion

This experience report demonstrates that agentic LLMs can be deployed for practical, large-scale mechanized proof development in verified compiler construction, given sufficient template guidance and expert oversight (2602.20082). The process yielded a full semantic preservation proof for CertiCoq's ANF transformation, following an established logical relations technique, and quantified concrete tradeoffs in proof complexity, tool limitations, and compositional reasoning. The main implications are the feasibility of template-based, human-in-the-loop LLM mechanization for verified software infrastructure, the necessity of enhanced tool support for robust proof state interaction, and unresolved challenges in divergence preservation for transformations that alter evaluation step counts. The findings highlight that LLMs may transform formal proof workflows in programming languages and system verification, contingent upon continued research in proof tooling, LLM robustness, and theoretical frameworks for resource-aware semantics.