M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Published 28 Oct 2024 in cs.CL and cs.SE | (2410.21157v1)

Abstract: Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code LLMs. Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper presents M²rc-Eval, a multilingual benchmark covering 18 programming languages for repository-level code completion evaluation.
It employs fine-grained bucket- and semantic-level annotations based on AST analysis to assess LLM performance across diverse coding contexts.
The research demonstrates that incorporating cross-file contexts with M²rc-Instruct tuning significantly enhances code completion accuracy across languages.

Evaluating Multilingual Code Completion with M $^2$ rc-Eval

The paper presents an in-depth study on repository-level code completion across a wide range of programming languages using a newly developed benchmark, M $^2$ rc-Eval. The study responds to the current limitations in evaluating the multilingual capabilities of code LLMs by introducing a benchmark covering 18 programming languages. This offers a significant improvement over existing benchmarks that focus on fewer languages and provide limited evaluation metrics.

Key Contributions

Multilingual Benchmark: The M $^2$ rc-Eval is notable for its expansive language coverage. It includes 18 diverse programming languages, which affords researchers a more comprehensive evaluation framework for code LLMs.
Fine-Grained Annotations: Two levels of annotations, bucket-level and semantic-level, are employed. These annotations provide insights into the completion capabilities of code LLMs across different contexts and complexities, based on an abstract syntax tree (AST).
Supplementary Dataset: The creation of M $^2$ rc-Instruct, a multilingual instruction corpus, supports the enhancement of repository-level code completion models.
Comprehensive Evaluation: The experiments demonstrate the efficacy of M $^2$ rc-Eval in gauging the abilities of popular code LLMs like StarCoder, DeepSeekCoder, and Code Llama across various evaluation metrics.

Results and Observations

The experimental results highlight the advantage of incorporating cross-file contexts in code completion tasks. Utilizing supplementary contexts significantly improves model performance, underscoring the importance of repository-level understanding in LLMs. Notably, fine-tuning with the M $^2$ rc-Instruct dataset achieves considerable improvements, indicating that specific instruction tuning can enhance cross-language code completion capabilities.

Additionally, the study reveals differing performance levels of code LLMs across programming languages and annotation types. This suggests that language-specific syntactic constructs influence model efficacy. The bucket-level results indicate declining performance with increasingly shallow nodes, whereas semantic annotations reveal strong performance in handling identifiers but weaknesses with special language structures.

Implications and Future Directions

The research contributes fundamentally to the field of multilingual code intelligence by providing tools and resources to better evaluate and develop code LLMs. The findings suggest promising directions for future research, such as the development of models that are better suited to handle syntactic and semantic nuances across diverse programming languages.

The introduction of detailed annotations opens pathways for more granular insights into model performance, offering opportunities to refine model architectures and training regimens further. The reliance on textual metrics like EM and ES, while indicative, hints at the need for execution-based evaluations to capture true semantic equivalence.

In conclusion, M $^2$ rc-Eval offers a substantial step forward in the assessment of LLM capabilities, promoting advancements in code completion technologies and software automation across multilingual environments. As the field progresses, such comprehensive evaluations will be crucial in aligning LLM development with practical software engineering needs.