From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Published 24 May 2025 in cs.SE and cs.CL | (2505.18789v2)

Abstract: Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that raw instruction-tuned LLM outputs perform poorly for FIM tasks without additional processing.
Supervised fine-tuning nearly doubles pass@1 scores for smaller models and boosts larger models by 40–50%, particularly for full-line infilling.
Adaptive post-processing remains essential for arbitrary or partial-line infilling to ensure semantically correct code completions.

Efficacy of Raw Instruction-Tuned LLM Outputs for Fill-in-the-Middle Code Generation

Overview

This work investigates whether the raw outputs of instruction-tuned code LLMs are sufficient for fill-in-the-middle (FIM) code generation, or if post-processing remains necessary for effective automatic evaluation. The focus is on the Qwen2.5-Coder model family, rigorously evaluated on HumanEval Infilling and SAFIM benchmarks. The methodology centers on supervised fine-tuning using data generated via LLMs, systematically probing the necessity and impact of various post-processing pipelines across different FIM task formulations.

Motivation and Research Questions

FIM code generation tasks mirror the practical workflows of code editing, where developers frequently insert code between existing segments. State-of-the-art code generation benchmarks, such as HumanEval Infilling and SAFIM, have enforced strict truncation-based post-processing to handle boundary mismatches and superfluous generations by LLMs. However, these heuristic rules may not generalize to all FIM scenarios, and can misjudge alternative yet valid completions by over-truncating outputs. The research is driven by three interrelated questions:

What is the intrinsic FIM effectiveness of instruction-tuned code LLMs without additional post-processing?
Does supervised fine-tuning (SFT) on explicitly constructed instruction-response pairs enable models to naturally delimit their infillings and improve automatic evaluation metrics?
Is there a scenario-dependent need to post-process outputs even after fine-tuning, specifically for partial versus full-line infilling tasks?

Experimental Methodology

The investigation employed supervised fine-tuning of Qwen2.5-Coder LLMs (7B, 14B, 32B) in both base and instruct variants. Training data generation involved extracting Python functions from open-source repositories and programmatically creating prefix-middle-suffix splits using Mixtral-8x22B as a synthetic annotator. The splits conformed to five FIM categories: random span, algorithmic block, control-flow expression, API call, and assignment expression. The final dataset comprised approximately 1 million high-quality instruction-response pairs.

For evaluation, models were benchmarked on HumanEval Infilling (single-line, multi-line, random-span) and SAFIM (algorithm-block, control-flow, API function call) using the standard pass@1 metric, with and without canonical post-processing steps.

Key Findings

Out-of-the-Box Behavior

Instruction-tuned LLMs exhibit poor out-of-the-box performance on FIM tasks, especially on challenging constructs involving arbitrary context boundaries. Extensive suboptimal generations and overlap issues persist, necessitating post-processing for competitive results. This observation is consistent across both HumanEval Infilling and SAFIM.

Impact of Supervised Fine-Tuning

Supervised fine-tuning constitutes a significant performance intervention. SFT yields an approximately twofold increase in pass@1 for 7B/14B and a substantial 40–50% increase for 32B models, particularly for full-line infilling tasks. Notably, instruct-tuned backbones show marginally superior sample efficiency post-SFT.

Raw Output Sufficiency

For tasks involving infilling of complete code lines, raw outputs from SFT models not only become evaluation-ready but often surpass post-processed outputs in automatic metrics. Excessive post-processing, such as truncation to fixed line counts, can inadvertently penalize semantically correct, multi-line completions, leading to lower scores.

For tasks with arbitrary or partial line boundaries (e.g., random-span infilling), post-processing remains essential. Lightweight heuristics that strip overlapping prefixes/suffixes are required to accurately extract the intended middle span and achieve optimal evaluation performance.

Implications and Recommendations

The key implication is that task- and data-aligned fine-tuning substantially closes the boundary-awareness gap in LLMs, enabling direct automatic evaluation for a large subset of practical FIM tasks. This finding challenges the necessity of rigid, dataset-prescribed post-processing for all FIM scenarios, supporting the adoption of more context-sensitive evaluation protocols.

However, the need for post-processing persists in cases where the search space of valid infillings is structurally unconstrained or overlaps are ambiguous (i.e., partial line or arbitrary-span fills). It is recommended to use overlap-removal heuristics only in such cases.

Practical deployment of FIM code LLMs can thus benefit from adaptive post-processing, applied conditionally based on the infilling context or benchmark-specific requirements.

Limitations and Future Directions

The study’s empirical scope is limited to Python and specific FIM benchmarks. Generalization to other programming languages, with their distinct syntactic and semantic issues, is unverified and merits further analysis. Training data diversity and the use of synthetic versus human-curated annotations influence model calibration and generalization; scaling data sources and leveraging multi-modal supervision could further reduce the marginal need for post-processing. Examination of more realistic code editing environments will determine the robustness of these findings.

Conclusion

Supervised fine-tuning of instruction-tuned code LLMs leads to substantial improvements in FIM tasks, often rendering outputs automatically evaluation-ready when the infill region comprises whole code lines. Post-processing can be largely de-emphasized in these settings but remains necessary for arbitrary or partial-span infilling. The results reframe best-practice evaluation protocols for LLM-based code completion and support continued investment in SFT-driven specialization for practical IDE workflows (2505.18789).