- The paper demonstrates that raw instruction-tuned LLM outputs perform poorly for FIM tasks without additional processing.
- Supervised fine-tuning nearly doubles pass@1 scores for smaller models and boosts larger models by 40–50%, particularly for full-line infilling.
- Adaptive post-processing remains essential for arbitrary or partial-line infilling to ensure semantically correct code completions.
Efficacy of Raw Instruction-Tuned LLM Outputs for Fill-in-the-Middle Code Generation
Overview
This work investigates whether the raw outputs of instruction-tuned code LLMs are sufficient for fill-in-the-middle (FIM) code generation, or if post-processing remains necessary for effective automatic evaluation. The focus is on the Qwen2.5-Coder model family, rigorously evaluated on HumanEval Infilling and SAFIM benchmarks. The methodology centers on supervised fine-tuning using data generated via LLMs, systematically probing the necessity and impact of various post-processing pipelines across different FIM task formulations.
Motivation and Research Questions
FIM code generation tasks mirror the practical workflows of code editing, where developers frequently insert code between existing segments. State-of-the-art code generation benchmarks, such as HumanEval Infilling and SAFIM, have enforced strict truncation-based post-processing to handle boundary mismatches and superfluous generations by LLMs. However, these heuristic rules may not generalize to all FIM scenarios, and can misjudge alternative yet valid completions by over-truncating outputs. The research is driven by three interrelated questions:
- What is the intrinsic FIM effectiveness of instruction-tuned code LLMs without additional post-processing?
- Does supervised fine-tuning (SFT) on explicitly constructed instruction-response pairs enable models to naturally delimit their infillings and improve automatic evaluation metrics?
- Is there a scenario-dependent need to post-process outputs even after fine-tuning, specifically for partial versus full-line infilling tasks?
Experimental Methodology
The investigation employed supervised fine-tuning of Qwen2.5-Coder LLMs (7B, 14B, 32B) in both base and instruct variants. Training data generation involved extracting Python functions from open-source repositories and programmatically creating prefix-middle-suffix splits using Mixtral-8x22B as a synthetic annotator. The splits conformed to five FIM categories: random span, algorithmic block, control-flow expression, API call, and assignment expression. The final dataset comprised approximately 1 million high-quality instruction-response pairs.
For evaluation, models were benchmarked on HumanEval Infilling (single-line, multi-line, random-span) and SAFIM (algorithm-block, control-flow, API function call) using the standard pass@1 metric, with and without canonical post-processing steps.
Key Findings
Out-of-the-Box Behavior
Instruction-tuned LLMs exhibit poor out-of-the-box performance on FIM tasks, especially on challenging constructs involving arbitrary context boundaries. Extensive suboptimal generations and overlap issues persist, necessitating post-processing for competitive results. This observation is consistent across both HumanEval Infilling and SAFIM.
Impact of Supervised Fine-Tuning
Supervised fine-tuning constitutes a significant performance intervention. SFT yields an approximately twofold increase in pass@1 for 7B/14B and a substantial 40–50% increase for 32B models, particularly for full-line infilling tasks. Notably, instruct-tuned backbones show marginally superior sample efficiency post-SFT.
Raw Output Sufficiency
For tasks involving infilling of complete code lines, raw outputs from SFT models not only become evaluation-ready but often surpass post-processed outputs in automatic metrics. Excessive post-processing, such as truncation to fixed line counts, can inadvertently penalize semantically correct, multi-line completions, leading to lower scores.
For tasks with arbitrary or partial line boundaries (e.g., random-span infilling), post-processing remains essential. Lightweight heuristics that strip overlapping prefixes/suffixes are required to accurately extract the intended middle span and achieve optimal evaluation performance.
Implications and Recommendations
The key implication is that task- and data-aligned fine-tuning substantially closes the boundary-awareness gap in LLMs, enabling direct automatic evaluation for a large subset of practical FIM tasks. This finding challenges the necessity of rigid, dataset-prescribed post-processing for all FIM scenarios, supporting the adoption of more context-sensitive evaluation protocols.
However, the need for post-processing persists in cases where the search space of valid infillings is structurally unconstrained or overlaps are ambiguous (i.e., partial line or arbitrary-span fills). It is recommended to use overlap-removal heuristics only in such cases.
Practical deployment of FIM code LLMs can thus benefit from adaptive post-processing, applied conditionally based on the infilling context or benchmark-specific requirements.
Limitations and Future Directions
The study’s empirical scope is limited to Python and specific FIM benchmarks. Generalization to other programming languages, with their distinct syntactic and semantic issues, is unverified and merits further analysis. Training data diversity and the use of synthetic versus human-curated annotations influence model calibration and generalization; scaling data sources and leveraging multi-modal supervision could further reduce the marginal need for post-processing. Examination of more realistic code editing environments will determine the robustness of these findings.
Conclusion
Supervised fine-tuning of instruction-tuned code LLMs leads to substantial improvements in FIM tasks, often rendering outputs automatically evaluation-ready when the infill region comprises whole code lines. Post-processing can be largely de-emphasized in these settings but remains necessary for arbitrary or partial-span infilling. The results reframe best-practice evaluation protocols for LLM-based code completion and support continued investment in SFT-driven specialization for practical IDE workflows (2505.18789).