Evaluating and Achieving Controllable Code Completion in Code LLM

Published 22 Jan 2026 in cs.SE and cs.CL | (2601.15879v1)

Abstract: Code completion has become a central task, gaining significant attention with the rise of LLM-based tools in software engineering. Although recent advances have greatly improved LLMs' code completion abilities, evaluation methods have not advanced equally. Most current benchmarks focus solely on functional correctness of code completions based on given context, overlooking models' ability to follow user instructions during completion-a common scenario in LLM-assisted programming. To address this limitation, we present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench), comprising 2,195 carefully designed completion tasks. Through comprehensive evaluation of over 40 mainstream LLMs across C3-Bench and conventional benchmarks, we reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks. Moreover, we develop a straightforward data synthesis pipeline that leverages Qwen2.5-Coder to generate high-quality instruction-completion pairs for supervised fine-tuning (SFT). The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench. Our findings provide valuable insights for enhancing LLMs' code completion and instruction-following capabilities, establishing new directions for future research in code LLMs. To facilitate reproducibility and foster further research in code LLMs, we open-source all code, datasets, and models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Controllable Code Completion (CCC) and the C³-Bench benchmark to measure instruction adherence in code LLMs.
It employs metrics like Pass@1, Instruction-Following Rate, and Edit Similarity to assess both functional and controllability performance.
Instruction tuning with synthetic pairs significantly improves model performance, addressing gaps in instruction-following capabilities.

Evaluating and Achieving Controllable Code Completion in Code LLMs

Motivation for Controllable Code Completion

The proliferation of code-centric LLMs (Code LLMs) has triggered notable improvements in automated code completion solutions, supporting diverse developer workflows. However, prevailing evaluation methodologies primarily emphasize functional correctness, typically assessed via similarity metrics or unit tests, while neglecting the models' capacity for nuanced instruction adherence—a critical real-world requirement. This disconnect has resulted in systematic overestimation of code LLM capabilities in production scenarios.

To bridge this evaluative gap, the paper introduces the concept of Controllable Code Completion (CCC), expanding the code completion task to explicitly incorporate fine-grained, user-supplied instructions that guide the completion process. CCC enables precise assessment of models' ability to heed specific implementation directives, such as algorithmic strategies, architectural organization, or block-level code scale, beyond passing unit tests.

C $^3$ -Bench: Benchmark Design and Construction

The authors present C $^3$ -Bench, an instruction-guided code completion benchmark engineered to evaluate both functional correctness and controllability of completions. C $^3$ -Bench comprises 2,195 Python CCC test instances, partitioned into:

1. Implementation-Control Completion (ICC):

Tasks require completion of the middle code segment per detailed implementation requirements (structural specification, algorithmic approach, control-flow, or parameter management), sampled from common software engineering demands.

2. Scale-Control Completion (SCC):

Tasks demand completion of code segments of precisely specified scope (line span, multi-line sequence, or statement block), forming a rigorous test for scale adherence.

Automation and quality control are enforced throughout benchmark construction. Middle code extraction leverages AST parsing and tree manipulation to access semantically coherent blocks from HumanEval and SAFIM datasets. Multiple implementations are generated via advanced LLMs, thoroughly validated against unit tests and expert consistency checks. Instruction generation employs both manual curation and LLM assistance, with explicit validation by senior developers and automated systems.

Figure 1: Overview of the construction pipeline of C $^3$ -Bench.

Evaluation Metrics for Controllability

C $^3$ -Bench incorporates three principal metrics:

Pass@1: Execution-based correctness via unit tests for ICC tasks.
Instruction-Following Rate (IF): Measures adherence to specified requirements; assessed semantically for ICC using LLM-based automated judging and structurally for SCC via AST analysis.
Edit Similarity (ES): Token-level static analysis to preliminarily evaluate generation quality.

Extensive validation of the LLM-judging framework reveals 98% agreement with expert evaluations, evidencing reliable automated assessment of instruction adherence.

Data Synthesis and Model Enhancement

To ameliorate the paucity of instruction-completion training pairs for open-source models, the authors devise an automated synthesis pipeline leveraging Qwen2.5-Coder-32B-Instruct, capable of producing large-scale pairs from unsupervised GitHub repositories, bootstrapped from manually curated seeds. This corpus enables supervised fine-tuning to enhance instruction-following abilities.

Figure 2: Training dataset analysis showing (a) sequence length distribution and (b) average component composition (Context Code, Middle Code, and Instruction).

Resulting models, Qwen2.5-Coder-C $^3$ (1.5B and 32B variants), are trained with strict decontamination strategies to preclude benchmark data leakage.

Empirical Findings and Analysis

Performance Disparities

Comprehensive benchmarking across 40+ models—including leading proprietary (GPT series, Claude, Gemini, o1-series) and open-source code-specialized LLMs (Qwen, DeepSeek, StarCoder, CodeStral)—exposes several key phenomena:

Instruction-following gap: Leading open-source models, while showing parity or superiority on similarity-based benchmarks (CrossCodeEval, RepoEval), consistently underperform proprietary models on C $^3$ -Bench. For instance, Qwen2.5-Coder-32B-C $^3$ achieves an IF rate of 52.5% for ICC and 80.7% for SCC, trailing the best proprietary models which reach up to 62.9% IF in ICC and 59.6% in SCC.
Effectiveness of instruction tuning: Fine-tuning with synthetic instruction pairs yields substantial gains in IF rates for open-source models, especially evident in SCC tasks, while ICC performance remains constrained by base model capacity.
Figure 3: Cross-benchmark comparison of model rankings. Instruction-following benchmarks induce substantial performance shifts compared to conventional functional correctness criteria.

Ablation Study on Instruction Guidance

Removing instructions from query prompts causes significant degradation in IF, whereas execution-based correctness (Pass@1) remains largely stable. The pronounced drop in IF unambiguously demonstrates the necessity of explicit instruction for true controllability in code completion.

Figure 4: Impact of instruction guidance on model performance. Pass@1 and IF-rate metrics are compared across models under two conditions: with instructions and without instructions.

Task-level Insights

Differentiated model behavior: Models vary in their ability to follow implementation versus scale-control instructions. Top proprietary models exhibit robust ICC adherence but often struggle with strict SCC constraints, suggesting deficiencies in training objectives or token-level modeling of code scale.
Ranking correlations: C $^3$ -Bench results are strongly correlated with models' performance on tasks requiring deep context understanding (ExecRepoBench) and human preference evaluations (Chatbot Arena), indicating that instruction-following ability is integral to advanced code comprehension.

Technical Implications

The research decisively demonstrates that traditional code completion evaluation metrics are insufficient for gauging instruction adherence. Instruction-guided benchmarks are essential to ascertain the practical effectiveness and controllability of code LLMs in production systems. The data synthesis approach for scalable instruction-completion pair generation offers a tractable strategy for enhancing open-source LLM performance in controllable completion.

Moreover, the work highlights three crucial challenges for future research:

Extension to multi-language and repository-level code contexts, increasing the generalizability of benchmarks.
Refinement of base model capabilities, especially for ICC tasks where bottlenecks in semantic understanding undermine improvements.
Development of lightweight, generalizable benchmarks and automated evaluators that are agnostic to infrastructure constraints.

Conclusion

C $^3$ -Bench establishes a rigorous, instruction-guided evaluation protocol for Code LLMs, revealing significant gaps in instruction-following capabilities and providing actionable strategies to enhance model controllability. The research substantiates that true code intelligence in LLMs must be measured not purely by functional correctness, but also by precise adherence to user intentions and requirements. The implications for AI development are clear: the next era of code LLM research must systematically integrate instruction-following evaluation and enhancement to realize robust, reliable, and user-aligned code generation systems.