EG-CFG: Execution-Guided Code Generation
- Execution-Guided Classifier-Free Guidance is a paradigm that integrates runtime execution feedback with LLM inference, enabling iterative correction of code errors.
- It employs a multi-stage pipeline including candidate sampling, execution signal extraction, and classifier-free guidance to enhance task success rates.
- Empirical benchmarks on MBPP and HumanEval demonstrate significant improvements in accuracy and efficiency compared to conventional decoding methods.
Execution-Guided Classifier-Free Guidance (EG-CFG) is a neural code generation paradigm that infuses real-time runtime execution signals into LLM inference. Unlike standard token generation pipelines that rely exclusively on learned syntax and pattern recognition, EG-CFG tightly incorporates line-by-line execution feedback, emulating the iterative, error-correcting workflow of expert programmers. The framework consists of a multi-stage pipeline that synthesizes candidate code completions, extracts granular runtime information through active code execution, and re-injects this feedback into the generation process via classifier-free guidance mechanisms. This approach achieves substantial gains in task success rates and supports native parallelism, enabling broad exploration of solution space and efficient computation (Lavon et al., 12 Jun 2025).
1. Formalization and Motivation
A code generation task in EG-CFG is structured as a prompt where defines the instruction template, specifies the problem, denotes test cases for correctness, and is the target function name. The principal objective is to autoregressively emit a token sequence such that, when assembled as a Python program, every test passes: Traditional LLM decoding strategies (greedy, temperature, top-) postpone runtime validation until after program completion. This design epoch fundamentally restricts models’ ability to repair semantic and logical faults at intermediate steps. EG-CFG bridges this gap by actively injecting execution traces at each token or line, guiding generative progress toward executable, correct solutions.
2. EG-CFG Multistage Pipeline
EG-CFG proceeds in three tightly interleaved stages for each line of code:
- Stage 1: Line-by-Line Beam Candidate Sampling The model, configured with temperature , emits candidate continuations of new lines via beam search:
Each is a raw code segment.
- Stage 2: Execution Signal Extraction Each candidate is parsed into executable form (e.g., via AST modification or truncation to ensure syntactic validity). Duplicates are removed:
For every and test case , the code fragment is executed:
yielding traces such as variable states, outputs, or error signals.
- Stage 3: Classifier-Free Guidance with Dynamic Execution Signals The execution feedback is encoded as a guidance prompt , then spliced into the current solution at index . CFG merges the unconditional prior and conditional distribution , amplifying tokens favored by successful partial executions.
3. Mathematical Characterization of CFG with Execution Signals
Execution feedback is integrated mathematically via a weighted convex combination. For each token position : where determines the strength of execution-guided conditioning. When , decoding reduces to the unconditional prior, while large values shift generation toward code fragments empirically validated against test cases. Token selection either follows
or stochastic sampling, per design.
4. Signal Propagation and Refreshing
Execution signals are computed only once per line and reused for all tokens within that line, providing coherent local guidance. At line boundaries, the signals are refreshed. The core pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
p_sol = p_pre i_dyn = position in p_pre to inject signals dynamics = None # holds p_signal until line completes while not end_of_code: # Stage 1: sample s candidates of next d lines candidates = BeamSearch(M, p_sol, d, t, s) # Stage 2: extract execution signals only once per line if at_line_start: executables = [ExtractExecutable(c) for c in candidates] C = Unique(executables) signals = [] for c_j in C: for t_m in tests: trace = RunAndTrace(c_j, t_m) signals.append((c_j, t_m, trace)) p_signal = [p_dyn_inst, signals] dynamics = splice(p_sol, i_dyn, p_signal) # form p_dyn # Stage 3: generate tokens of the line under CFG while not at_line_end: logits_uncond = M.logits(p_sol) logits_cond = M.logits(dynamics) logits_cfg = logits_uncond + gamma*(logits_cond - logits_uncond) w_next = argmax_softmax(logits_cfg) p_sol += w_next dynamics = update_splice(p_sol, i_dyn, p_signal) # After finishing the line: at_line_start = True # refresh next iteration |
p_signal within a line ensures the invariance of execution feedback across token positions, while systematic recomputation at line starts admits timely correction.
5. Native Parallelism: Multiple Independent Agents
EG-CFG inherently supports parallel exploration across diverse agent configurations: hyperparameters such as beam size , horizon , temperature , classifier-free guidance strength , and instruction template can be systematically enumerated. Each tuple constitutes a fully-independent agent, which processes the coding task in isolation. The parallel execution model allows for early termination upon solution discovery (first agent to pass all tests), unlike iterative multi-agent systems that exhibit sequential dependencies.
6. Empirical Performance and Benchmarking
On MBPP (500 tasks) with DeepSeek-V3-0324, standard prompting yields 82.8% accuracy; EG-CFG elevates task success to 96.6%, a state-of-the-art result. For MBPP-ET, performance rises from 64.8% to 73.0%. On HumanEval (164 tasks), EG-CFG increases accuracy from 82.9% to 96.95%; HumanEval-ET from 79.2% to 87.2%. On CodeContests, EG-CFG achieves 58.18% compared to 41.81%—an absolute improvement of 16.4 points. These results markedly surpass previous self-debugging and multi-agent methods, even when restricted to a single open-source model.
7. Computational Complexity and Overhead
EG-CFG imposes additional computational requirements relative to conventional decoding:
- Beam search for candidate completions per line
- AST parsing and duplicate filtering
- executions per line
- Two forward LLM passes per token (unconditional, conditional distributions)
With typical settings (, ), execution counts remain tractable, and per-token double LLM evaluations double inference time. MBPP mean per-task runtime under full parallel scheduling is approximately 271 s (DeepSeek-V3-0324), modestly lower than MapCoder (283 s) and substantially faster than MGDebugger (842 s). Early termination further reduces wall-clock time.
In conclusion, EG-CFG integrates semantic runtime feedback with generative token modeling, producing executable code in an efficient, guided fashion. The combination of linewise sampling, AST-based execution tracing, and classifier-free conditional weighting forms a robust architecture for high-fidelity neural code synthesis (Lavon et al., 12 Jun 2025).