- The paper demonstrates that execution feedback significantly improves performance, raising the HumanEval pass@1 from 46.7% to 57.3% for a 3B model.
- It reveals that refiner capability, rather than pipeline topology complexity, is the key driver for iterative self-refinement benefits.
- The study highlights the importance of early stopping to prevent destructive iterations and ensure reliable code generation.
Execution Feedback as the Key Mechanism in Small-Scale Code Generation Pipelines
Motivation and Research Objectives
At the 1–3B parameter scale, LLMs offer practical utility for on-device and resource-constrained deployments but demonstrate severe limitations in complex code generation tasks. This paper systematically investigates whether pipelined compositions of such small models can recover capability lost relative to larger LMs, particularly by leveraging iterative self-refinement with execution feedback. Previous work on larger-scale or API-based systems (e.g., MoA, DSPy, AgentConductor) documents strong gains for pipeline architectures, but the effectiveness and mechanisms at the 1–3B scale remain poorly understood. The central thesis advanced here is that execution feedback is both necessary and sufficient for significant performance improvements among small LMs, whereas pipeline topology complexity or generator identity is largely irrelevant given the tested model pool (2604.21950).
Experimental Framework
The experimental protocol centers on code generation for two standard benchmarks: HumanEval (164 tasks) and sanitized MBPP (427 tasks). The pipelines constructed consist of sequences of small LMs, with roles partitioned into generator, (optional) analyzer, and refiner nodes, all orchestrated around execution feedback supplied by a sandboxed executor node. Inspired by the NEAT (NeuroEvolution of Augmenting Topologies) algorithm, a search was performed to explore the optimal assignment of models, prompts, temperatures, and pipeline topology, but restricted to constrained linear sequences to mitigate search instability and noise. The pipeline always enforced early stopping: the system halts as soon as a test passes, preventing further (potentially destructive) refinement iterations.
Three general-purpose models were considered in the search pool (Gemma3:1B, qwen2.5:1.5B, llama3.2:3B), with the code-specialized qwen2.5-coder:3B used as an out-of-search, fixed benchmark. All experiments were run with purely local inference, demonstrating high accessibility while exacerbating hardware constraints. Model performance was always reported as the five-run mean with standard deviation, capturing stochastic decoding effects.
Empirical Findings
Quantitative Results: Dominance of Execution Feedback
Self-refinement using execution feedback yields robust, repeatable improvements exceeding four standard deviations over single-shot generation baselines on both HumanEval and MBPP. Specifically, refining with execution feedback increased HumanEval pass@1 from 46.7% (76.6/164) to 57.3% (94.0/164) for llama3.2:3B, with a comparable lift on MBPP. These gains are observed regardless of whether the pipeline uses the same or different models for generator and refiner roles; in all tested general-purpose pipelines, refiner capability dictated final performance rather than generator selection. The use of a stronger model as refiner (e.g., 3B model) with a weaker generator (1.5B) matched a 3B self-refinement pipeline.
Code-specialized models (qwen2.5-coder:3B) outperform all general-purpose pipelines by a statistically significant margin (e.g., coder self-refine achieves 85.1% on HumanEval), and further self-refinement still provides a small but measurable enhancement.
Mechanistic Analysis: Error Taxonomy
Refinement almost exclusively repairs runtime errors such as NameError and SyntaxError, where explicit tracebacks provide local, actionable repair signals. Logic errors (AssertionError) show very low fix rates, confirming that ambiguous failure signals cripple pipeline capacity for iterative logical debugging. Thus, performance improvements are narrow in mechanism even if broad in statistical impact.
Pipeline Topology: Minimal Impact
NEAT-inspired search over pipeline topology yields little advantage over manually designed simple linear generate-execute-refine loops, with only marginal/suggestive improvements when more complex elements (e.g., analyzer nodes) are inserted. Across all completed runs, evolutionary search reliably rediscovered or matched simple, human-designer pipelines; pipeline complexity produced no statistically significant performance gains.
Early Stopping: Critical for Positive Returns
Without early stopping, additional refinement passes are uniformly detrimental, with each iteration introducing net-negative changes (i.e., breaking previously passing code more often than fixing additional failures). The paradoxical success of multiple iterations is resolved only because early stopping prunes destructive steps.
Evaluation Noise and Evolutionary Search
Single-run fitness evaluations, especially over small validation sets, systematically inflate the perceived quality of pipelines by 5–7% compared to multi-run means. This injects a strong survivor bias into evolutionary search, favoring lucky genomes over genuinely strong architectures. Deterministic decoding (T=0) or multi-evaluation fitness averaging is required to produce robust search outcomes.
Theoretical Implications and Context in Literature
This study clarifies and sharpens the literature boundary around small-model composition. Unlike prior results at larger scales, where architecture, model ensembling, or complex chain-of-thought approaches enhance reasoning and generation, at 1–3B scale such methods have negligible effect unless paired with reliable external signals. This aligns with and extends studies on mixture-of-agents (e.g., Self-MoA, CYCLE), which show degradation or stagnation when mixing weak models in purely text domains (Li et al., 2 Feb 2025), [cycle2024]. Here, explicit verification feedback is shown to be a unique enabler for any positive composition effect.
The paper's findings are consistent with the limitations identified in the Small Model Learnability Gap [learnabilitygap2025], which documents the failure of behavioral fine-tuning at sub-2B scale. The broader pattern that emerges is that small models benefit from closed-loop, signal-driven repair only when operated within domains permitting explicit, local, machine-actionable feedback.
Practical Recommendations and Future Directions
- For small-model code generation (1–3B), employ iterative refinement only if the domain provides executable, testable, or otherwise explicit external feedback.
- Heuristic or text-only pipelines likely confer no benefit or are harmful; code-specialized models are much stronger than general-purpose pipelines at this scale.
- Early stopping is mandatory to forestall regressions during self- or cross-refinement.
- Investment in the refiner should be prioritized over generator selection.
Potential future work could evaluate whether the pattern generalizes at higher parameter scales (e.g., 7B+) or in domains with more complex, less explicit verification oracles. Applying these frameworks to other verification-rich settings (SQL execution, proof checking, compiler output) is a natural extension but remains untested.
Conclusion
This study demonstrates that, for 1–3B parameter LMs on code generation tasks, execution feedback is the singular mechanism enabling meaningful improvements from pipelined model composition. Pipeline topology complexity provides no significant benefit over simple refinement loops, and the gains are strictly localized to error types addressable via explicit tracebacks. These insights delineate the limits and prospects for small-model orchestration, focusing the field on leveraging verification signals and refiner competence rather than architectural complexity (2604.21950).