Papers
Topics
Authors
Recent
Search
2000 character limit reached

EG-CFG: Execution-Guided Code Generation

Updated 23 January 2026
  • Execution-Guided Classifier-Free Guidance is a paradigm that integrates runtime execution feedback with LLM inference, enabling iterative correction of code errors.
  • It employs a multi-stage pipeline including candidate sampling, execution signal extraction, and classifier-free guidance to enhance task success rates.
  • Empirical benchmarks on MBPP and HumanEval demonstrate significant improvements in accuracy and efficiency compared to conventional decoding methods.

Execution-Guided Classifier-Free Guidance (EG-CFG) is a neural code generation paradigm that infuses real-time runtime execution signals into LLM inference. Unlike standard token generation pipelines that rely exclusively on learned syntax and pattern recognition, EG-CFG tightly incorporates line-by-line execution feedback, emulating the iterative, error-correcting workflow of expert programmers. The framework consists of a multi-stage pipeline that synthesizes candidate code completions, extracts granular runtime information through active code execution, and re-injects this feedback into the generation process via classifier-free guidance mechanisms. This approach achieves substantial gains in task success rates and supports native parallelism, enabling broad exploration of solution space and efficient computation (Lavon et al., 12 Jun 2025).

1. Formalization and Motivation

A code generation task in EG-CFG is structured as a prompt pinst=(p0,ptask,T,fname)p_\mathrm{inst} = (p_0,\, p_\mathrm{task},\, T,\, f_\mathrm{name}) where p0p_0 defines the instruction template, ptaskp_\mathrm{task} specifies the problem, T={t1,,tT}T=\{t_1,\dots,t_{|T|}\} denotes test cases for correctness, and fnamef_\mathrm{name} is the target function name. The principal objective is to autoregressively emit a token sequence w=[w0,w1,,wN1]w^* = [w_0^*, w_1^*, \dots, w_{N-1}^*] such that, when assembled as a Python program, every test tjTt_j \in T passes: tjT:Execute(w,tj)=success.\forall\,t_j \in T:\quad \mathrm{Execute}(w^*, t_j) = \text{success}. Traditional LLM decoding strategies (greedy, temperature, top-pp) postpone runtime validation until after program completion. This design epoch fundamentally restricts models’ ability to repair semantic and logical faults at intermediate steps. EG-CFG bridges this gap by actively injecting execution traces at each token or line, guiding generative progress toward executable, correct solutions.

2. EG-CFG Multistage Pipeline

EG-CFG proceeds in three tightly interleaved stages for each line of code:

  • Stage 1: Line-by-Line Beam Candidate Sampling The model, configured with temperature tt, emits ss candidate continuations of dd new lines via beam search:

cjM(cpsol;d,t),j=1,,s.c^j \sim M(c \mid p_\mathrm{sol};\, d, t),\qquad j=1,\dots,s.

Each cjc^j is a raw code segment.

  • Stage 2: Execution Signal Extraction Each candidate cjc^j is parsed into executable form c^j=ExtractExecutable(cj)\hat c^j = \mathrm{ExtractExecutable}(c^j) (e.g., via AST modification or truncation to ensure syntactic validity). Duplicates are removed:

C=Unique{c^1,,c^s}.C = \mathrm{Unique}\{\hat c^1, \dots, \hat c^s\}.

For every c^jC\hat c^j \in C and test case tmTt_m \in T, the code fragment is executed:

ej,m=ExtractExecutionFeedback(c^j,tm),e^{j,m} = \mathrm{ExtractExecutionFeedback}(\hat c^j, t_m),

yielding traces such as variable states, outputs, or error signals.

  • Stage 3: Classifier-Free Guidance with Dynamic Execution Signals The execution feedback is encoded as a guidance prompt psignal=[pdyninst,{(c^j,tm,ej,m)}]p_\mathrm{signal} = [p_\mathrm{dyn-inst}, \{ (\hat c^j, t_m, e^{j,m}) \}], then spliced into the current solution at index idyni_\mathrm{dyn}. CFG merges the unconditional prior M(wipsol)M(w_i | p_\mathrm{sol}) and conditional distribution M(wipdyn)M(w_i | p_\mathrm{dyn}), amplifying tokens favored by successful partial executions.

3. Mathematical Characterization of CFG with Execution Signals

Execution feedback is integrated mathematically via a weighted convex combination. For each token position ii: logMCFG(wipsol,pdyn)=(1γ)logM(wipsol)+γlogM(wipdyn),\log M_\mathrm{CFG}(w_i | p_\mathrm{sol}, p_\mathrm{dyn}) = (1-\gamma)\, \log M(w_i | p_\mathrm{sol}) + \gamma\, \log M(w_i | p_\mathrm{dyn}), where γ0\gamma \ge 0 determines the strength of execution-guided conditioning. When γ=0\gamma=0, decoding reduces to the unconditional prior, while large γ\gamma values shift generation toward code fragments empirically validated against test cases. Token selection either follows

wi=argmaxw  MCFG(wpsol,pdyn),w_i = \arg\max_{w}\; M_\mathrm{CFG}(w|p_\mathrm{sol}, p_\mathrm{dyn}),

or stochastic sampling, per design.

4. Signal Propagation and Refreshing

Execution signals are computed only once per line and reused for all tokens within that line, providing coherent local guidance. At line boundaries, the signals are refreshed. The core pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
p_sol  = p_pre
i_dyn  = position in p_pre to inject signals
dynamics = None  # holds p_signal until line completes
while not end_of_code:
    # Stage 1: sample s candidates of next d lines
    candidates = BeamSearch(M, p_sol, d, t, s)
    
    # Stage 2: extract execution signals only once per line
    if at_line_start:
        executables = [ExtractExecutable(c) for c in candidates]
        C = Unique(executables)
        signals = []
        for c_j in C:
            for t_m in tests:
                trace = RunAndTrace(c_j, t_m)
                signals.append((c_j, t_m, trace))
        p_signal = [p_dyn_inst, signals]
        dynamics = splice(p_sol, i_dyn, p_signal)  # form p_dyn

    # Stage 3: generate tokens of the line under CFG
    while not at_line_end:
        logits_uncond = M.logits(p_sol)
        logits_cond   = M.logits(dynamics)
        logits_cfg    = logits_uncond + gamma*(logits_cond - logits_uncond)
        w_next        = argmax_softmax(logits_cfg)
        p_sol        += w_next
        dynamics      = update_splice(p_sol, i_dyn, p_signal)
    # After finishing the line:
    at_line_start = True  # refresh next iteration
The persistent use of p_signal within a line ensures the invariance of execution feedback across token positions, while systematic recomputation at line starts admits timely correction.

5. Native Parallelism: Multiple Independent Agents

EG-CFG inherently supports parallel exploration across diverse agent configurations: hyperparameters such as beam size ss, horizon dd, temperature tt, classifier-free guidance strength γ\gamma, and instruction template p0p_0 can be systematically enumerated. Each (s,d,t,γ,p0)(s, d, t, \gamma, p_0) tuple constitutes a fully-independent agent, which processes the coding task in isolation. The parallel execution model allows for early termination upon solution discovery (first agent to pass all tests), unlike iterative multi-agent systems that exhibit sequential dependencies.

6. Empirical Performance and Benchmarking

On MBPP (500 tasks) with DeepSeek-V3-0324, standard prompting yields 82.8% accuracy; EG-CFG elevates task success to 96.6%, a state-of-the-art result. For MBPP-ET, performance rises from 64.8% to 73.0%. On HumanEval (164 tasks), EG-CFG increases accuracy from 82.9% to 96.95%; HumanEval-ET from 79.2% to 87.2%. On CodeContests, EG-CFG achieves 58.18% compared to 41.81%—an absolute improvement of 16.4 points. These results markedly surpass previous self-debugging and multi-agent methods, even when restricted to a single open-source model.

7. Computational Complexity and Overhead

EG-CFG imposes additional computational requirements relative to conventional decoding:

  • Beam search for ss candidate completions per line
  • AST parsing and duplicate filtering
  • s×Ts \times |T| executions per line
  • Two forward LLM passes per token (unconditional, conditional distributions)

With typical settings (s=3s=3, T3|T| \approx 3), execution counts remain tractable, and per-token double LLM evaluations double inference time. MBPP mean per-task runtime under full parallel scheduling is approximately 271 s (DeepSeek-V3-0324), modestly lower than MapCoder (283 s) and substantially faster than MGDebugger (842 s). Early termination further reduces wall-clock time.

In conclusion, EG-CFG integrates semantic runtime feedback with generative token modeling, producing executable code in an efficient, guided fashion. The combination of linewise sampling, AST-based execution tracing, and classifier-free conditional weighting forms a robust architecture for high-fidelity neural code synthesis (Lavon et al., 12 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Execution-Guided Classifier-Free Guidance (EG-CFG).