Verifiable CoT via Execution Traces

Updated 31 January 2026

The paper demonstrates that grounding each reasoning step in deterministic execution traces eliminates logical hallucinations and greatly improves interpretability.
The methodology employs systematic trace collection and structured LLM prompting, ensuring each step of reasoning reflects concrete program state transitions.
Empirical findings show significant performance gains in code and tool-use applications, with improved intermediate accuracy, reduced overthinking, and robust validation.

Generating verifiable Chain-of-Thought (CoT) from execution traces is a methodology aimed at constructing supervision signals for LLMs in which each reasoning step is grounded in, and faithfully reflects, the concrete sequence of operations performed by an executing program. This paradigm addresses the limitations of plausible-sounding but potentially unfaithful CoT data, ensuring that each natural language rationale step corresponds exactly to the computed behavior of code or tool sequences. Recent work has refined and operationalized this approach with rigor across logic, programming, and tool-use domains, yielding datasets and frameworks that enable stepwise verifiability, eliminate hallucinations, and improve reasoning accuracy and reliability (Thakur et al., 28 Nov 2025, Jung et al., 12 Jun 2025, Hao et al., 4 Jan 2026).

1. Formalization of Execution Traces and Soundness Guarantees

An execution trace $\tau$ captures the temporal evolution of a program or environment state under deterministic code execution. For code reasoning, the trace is structured as a sequence of triples: $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ where $s_i$ is the program state after event $i$ , and $e_i$ is an atomic event (executed line, assignment, branch, or function call/return) (Thakur et al., 28 Nov 2025, Jung et al., 12 Jun 2025). For tool-use agents, traces generalize to composed API invocations and parameter flows through an API graph—each step representing calls and resulting environment updates (Hao et al., 4 Jan 2026).

Soundness is enforced by requiring that each generated CoT step $R_i$ which asserts, e.g., “variable $x$ has value $v$ after line $\ell$ ,” must be directly supported by the trace: $\forall i,\quad \mathrm{Assert}(\texttt{MentionedValue}(R_i, x, v)) \implies \exists (s_{j-1}, e_j, s_j) \in \tau : e_j = \texttt{ExecLine}(\ell), s_j(x) = v$ Completeness requires that all transitions impacting final outputs have corresponding narrative steps (Thakur et al., 28 Nov 2025). This construction eliminates logical hallucinations by grounding every inference in verifiable state transitions.

2. Trace Collection and Instrumentation Protocols

The foundation for verifiable CoT generation is comprehensive and deterministic trace capture. For program reasoning, code is instrumented with tools such as pysnooper or Snoop to emit events at every function entry/exit, line execution, assignment, and return. Only code passing deterministic execution filters (e.g., exclusion of random modules, timeouts, tractability thresholds) is retained (Thakur et al., 28 Nov 2025, Jung et al., 12 Jun 2025).

In tool-user agents, trace collection is framed as legal API call sequences sampled from a dynamic API Graph $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 0, which reflects observed agent failure cases:

$\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 1: set of APIs exhibiting failure;
$\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 2: directed edges encoding invocation order/dependencies;
$\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 3: parameter signature constraints (ensuring call validity).

Sampling “hard traces” focuses on subgraphs corresponding to difficult, failure-prone behaviors for model improvement (Hao et al., 4 Jan 2026).

3. Trace-to-CoT Transformation and Data Generation Pipelines

Conversion from traces to CoT rationales employs structured prompting of LLMs with synthesized or real trace events. The high-level pipeline comprises:

Problem synthesis: Extraction of programming concepts; LLM-driven generation of problem descriptions, reference signatures, and multiple candidate solutions plus tests.
Execution-based verification: Dual agreement via cross-execution of all (solution, test suite) pairs; solution clusters passing maximal overlapping tests are retained as ground-truth (Thakur et al., 28 Nov 2025).
Instrumentation and trace extraction: Each verified solution is executed on passing tests, emitting a log $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 4.
Narrative mapping (trace-to-CoT): For each trace, an LLM is prompted to generate a CoT where each step refers to state changes in $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 5, both in forward (“Given input $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 6, what is output $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 7?”) and backward (“Given output $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 8, what input $\tau = [(s_0, e_1, s_1), (s_1, e_2, s_2), \ldots, (s_{T-1}, e_T, s_T)]$ 9 could have produced it?”) directions (Thakur et al., 28 Nov 2025).
Filtering and abstraction: In tool-use, primitive traces are abstracted into “advanced tools” and “hard queries” via agentic modularization, and Reasoner–Verifier feedback loops enforce that the CoT exactly reconstructs the successful trace (Hao et al., 4 Jan 2026).

This process produces supervision datasets where each (problem, trace, CoT, answer) tuple is guaranteed correct-by-construction.

4. Verification Frameworks and Formal Approaches

Verifiability extends beyond matching against execution logs. Typed frameworks inspired by Curry-Howard correspondence encode each CoT step as a type-annotated logical inference. Under such a framework: $s_i$ 0 Each inference (e.g., $s_i$ 1, $s_i$ 2) matches a reasoning rule with explicit preconditions and output type (Perrier, 1 Oct 2025). Type checkers, constructed with domain-specific combinators and units, provide machine-verifiable certification of CoT faithfulness. This strengthens the guarantee from empirical soundness (trace matching) to formal validity under a compositional proof calculus, allowing auditors or proof assistants to reconstruct the logical flow from premises to conclusion (Perrier, 1 Oct 2025).

5. Evaluation, Empirical Findings, and Ablation Results

Empirical studies consistently demonstrate substantial performance gains for models trained with verifiable, execution-trace-grounded CoT:

Model/Setting	Output Prediction	Input Prediction	HumanEval@1	LiveCodeBench-Exec
Granite-3.3-8B (base)	15.5%	14.3%	62.0%	18.3%
+ Bi-dir Trace-CoT (25k)	45.7%	42.1%	81.5%	44.3%
Qwen2.5-Coder-7B (base)	45.3%	47.5%	62.0%	46.3%
+ Bi-dir Trace-CoT (25k)	59.7%	61.9%	81.5%	68.2%

Ablations indicate that:

Trace-grounded CoT reduces logical hallucination, ensures high intermediate step accuracy (91.5% vs. 73.0% for ungrounded CoT), and lessens “overthinking” (token count reductions up to 20%) (Jung et al., 12 Jun 2025).
In tool-use domains, advanced abstraction and feedback-based verification increase multi-turn accuracy by up to 17 points over baseline and 19 points over proprietary large models (Hao et al., 4 Jan 2026).
Typed proof-based frameworks achieve strict certification rates with answer precision exceeding 91.6% (Perrier, 1 Oct 2025).

6. Challenges, Limitations, and Open Directions

While the approach provides strong verifiability, several practical challenges remain:

Current pipelines are typically limited to single-language (Python) and single-function scopes; expanding to multi-module, multi-language, or I/O-intensive code requires language-agnostic tracers and scalable storage solutions (Thakur et al., 28 Nov 2025).
Redundant or non-contributory trace events present signal-to-noise problems; lossless but compact trace pruning remains an open problem.
Symbolic execution and deeper formal methods could supplement empirical verification for total coverage.
Adapting trace-to-CoT pipelines to richer domains (e.g., probabilistic programs, sets, graphs) demands new type systems, rulesets, and mapping grammars (Perrier, 1 Oct 2025).

A plausible implication is that as model and benchmark complexity scale, precise verifiability will become essential for both interpretability and safety—especially in critical domains.

7. Significance and Impact on Reasoning-Centric AI

Grounding CoT in execution traces establishes a paradigm for constructing reasoning supervision that is transparent, reproducible, and robust against the failure cases induced by plausible but unfaithful rationalization. Empirically, this leads to marked improvements in output prediction, explanation, and generalization in both code and tool-use settings. Formally, this paves the way for proof-carrying CoT, offering machine-checkable certificates of faithfulness and evidence for correct intermediate logical steps. This methodology also underpins ongoing efforts to benchmark and audit “reasoning reliability” of LLMs at scale, offering a foundation for future research integrating execution-grounded rationales with formal proof-checking infrastructure (Thakur et al., 28 Nov 2025, Jung et al., 12 Jun 2025, Hao et al., 4 Jan 2026, Perrier, 1 Oct 2025).