Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation

Published 23 Apr 2026 in cs.SE, cs.AI, and cs.LG | (2604.21950v1)

Abstract: Small LLMs (1-3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1-3B models with execution feedback, and use a NEAT-inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general-purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net-negative. The code-specialized models outperform every general-purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text-only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate-execute-refine loop we found manually, with no clearly significant gain from added topology. Single-evaluation fitness inflates results by 5-7 percent, selecting lucky genomes over good ones. On these benchmarks at 1-3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.

Abstract PDF Upgrade to Chat

Authors (1)

Charles Junichi McAndrews

Summary

The paper demonstrates that execution feedback significantly improves performance, raising the HumanEval pass@1 from 46.7% to 57.3% for a 3B model.
It reveals that refiner capability, rather than pipeline topology complexity, is the key driver for iterative self-refinement benefits.
The study highlights the importance of early stopping to prevent destructive iterations and ensure reliable code generation.

Execution Feedback as the Key Mechanism in Small-Scale Code Generation Pipelines

Motivation and Research Objectives

At the 1–3B parameter scale, LLMs offer practical utility for on-device and resource-constrained deployments but demonstrate severe limitations in complex code generation tasks. This paper systematically investigates whether pipelined compositions of such small models can recover capability lost relative to larger LMs, particularly by leveraging iterative self-refinement with execution feedback. Previous work on larger-scale or API-based systems (e.g., MoA, DSPy, AgentConductor) documents strong gains for pipeline architectures, but the effectiveness and mechanisms at the 1–3B scale remain poorly understood. The central thesis advanced here is that execution feedback is both necessary and sufficient for significant performance improvements among small LMs, whereas pipeline topology complexity or generator identity is largely irrelevant given the tested model pool (2604.21950).

Experimental Framework

The experimental protocol centers on code generation for two standard benchmarks: HumanEval (164 tasks) and sanitized MBPP (427 tasks). The pipelines constructed consist of sequences of small LMs, with roles partitioned into generator, (optional) analyzer, and refiner nodes, all orchestrated around execution feedback supplied by a sandboxed executor node. Inspired by the NEAT (NeuroEvolution of Augmenting Topologies) algorithm, a search was performed to explore the optimal assignment of models, prompts, temperatures, and pipeline topology, but restricted to constrained linear sequences to mitigate search instability and noise. The pipeline always enforced early stopping: the system halts as soon as a test passes, preventing further (potentially destructive) refinement iterations.

Three general-purpose models were considered in the search pool (Gemma3:1B, qwen2.5:1.5B, llama3.2:3B), with the code-specialized qwen2.5-coder:3B used as an out-of-search, fixed benchmark. All experiments were run with purely local inference, demonstrating high accessibility while exacerbating hardware constraints. Model performance was always reported as the five-run mean with standard deviation, capturing stochastic decoding effects.

Empirical Findings

Quantitative Results: Dominance of Execution Feedback

Self-refinement using execution feedback yields robust, repeatable improvements exceeding four standard deviations over single-shot generation baselines on both HumanEval and MBPP. Specifically, refining with execution feedback increased HumanEval pass@1 from $46.7\%$ (76.6/164) to $57.3\%$ (94.0/164) for llama3.2:3B, with a comparable lift on MBPP. These gains are observed regardless of whether the pipeline uses the same or different models for generator and refiner roles; in all tested general-purpose pipelines, refiner capability dictated final performance rather than generator selection. The use of a stronger model as refiner (e.g., 3B model) with a weaker generator (1.5B) matched a 3B self-refinement pipeline.

Code-specialized models (qwen2.5-coder:3B) outperform all general-purpose pipelines by a statistically significant margin (e.g., coder self-refine achieves $85.1\%$ on HumanEval), and further self-refinement still provides a small but measurable enhancement.

Mechanistic Analysis: Error Taxonomy

Refinement almost exclusively repairs runtime errors such as NameError and SyntaxError, where explicit tracebacks provide local, actionable repair signals. Logic errors (AssertionError) show very low fix rates, confirming that ambiguous failure signals cripple pipeline capacity for iterative logical debugging. Thus, performance improvements are narrow in mechanism even if broad in statistical impact.

Pipeline Topology: Minimal Impact

NEAT-inspired search over pipeline topology yields little advantage over manually designed simple linear generate-execute-refine loops, with only marginal/suggestive improvements when more complex elements (e.g., analyzer nodes) are inserted. Across all completed runs, evolutionary search reliably rediscovered or matched simple, human-designer pipelines; pipeline complexity produced no statistically significant performance gains.

Early Stopping: Critical for Positive Returns

Without early stopping, additional refinement passes are uniformly detrimental, with each iteration introducing net-negative changes (i.e., breaking previously passing code more often than fixing additional failures). The paradoxical success of multiple iterations is resolved only because early stopping prunes destructive steps.

Evaluation Noise and Evolutionary Search

Single-run fitness evaluations, especially over small validation sets, systematically inflate the perceived quality of pipelines by 5–7% compared to multi-run means. This injects a strong survivor bias into evolutionary search, favoring lucky genomes over genuinely strong architectures. Deterministic decoding ( $T=0$ ) or multi-evaluation fitness averaging is required to produce robust search outcomes.

Theoretical Implications and Context in Literature

This study clarifies and sharpens the literature boundary around small-model composition. Unlike prior results at larger scales, where architecture, model ensembling, or complex chain-of-thought approaches enhance reasoning and generation, at 1–3B scale such methods have negligible effect unless paired with reliable external signals. This aligns with and extends studies on mixture-of-agents (e.g., Self-MoA, CYCLE), which show degradation or stagnation when mixing weak models in purely text domains (Li et al., 2 Feb 2025), [cycle2024]. Here, explicit verification feedback is shown to be a unique enabler for any positive composition effect.

The paper's findings are consistent with the limitations identified in the Small Model Learnability Gap [learnabilitygap2025], which documents the failure of behavioral fine-tuning at sub-2B scale. The broader pattern that emerges is that small models benefit from closed-loop, signal-driven repair only when operated within domains permitting explicit, local, machine-actionable feedback.

Practical Recommendations and Future Directions

For small-model code generation (1–3B), employ iterative refinement only if the domain provides executable, testable, or otherwise explicit external feedback.
Heuristic or text-only pipelines likely confer no benefit or are harmful; code-specialized models are much stronger than general-purpose pipelines at this scale.
Early stopping is mandatory to forestall regressions during self- or cross-refinement.
Investment in the refiner should be prioritized over generator selection.

Potential future work could evaluate whether the pattern generalizes at higher parameter scales (e.g., 7B+) or in domains with more complex, less explicit verification oracles. Applying these frameworks to other verification-rich settings (SQL execution, proof checking, compiler output) is a natural extension but remains untested.

Conclusion

This study demonstrates that, for 1–3B parameter LMs on code generation tasks, execution feedback is the singular mechanism enabling meaningful improvements from pipelined model composition. Pipeline topology complexity provides no significant benefit over simple refinement loops, and the gains are strictly localized to error types addressable via explicit tracebacks. These insights delineate the limits and prospects for small-model orchestration, focusing the field on leveraging verification signals and refiner competence rather than architectural complexity (2604.21950).

Markdown Report Issue