- The paper presents M²rc-Eval, a multilingual benchmark covering 18 programming languages for repository-level code completion evaluation.
- It employs fine-grained bucket- and semantic-level annotations based on AST analysis to assess LLM performance across diverse coding contexts.
- The research demonstrates that incorporating cross-file contexts with M²rc-Instruct tuning significantly enhances code completion accuracy across languages.
Evaluating Multilingual Code Completion with M2rc-Eval
The paper presents an in-depth study on repository-level code completion across a wide range of programming languages using a newly developed benchmark, M2rc-Eval. The study responds to the current limitations in evaluating the multilingual capabilities of code LLMs by introducing a benchmark covering 18 programming languages. This offers a significant improvement over existing benchmarks that focus on fewer languages and provide limited evaluation metrics.
Key Contributions
- Multilingual Benchmark: The M2rc-Eval is notable for its expansive language coverage. It includes 18 diverse programming languages, which affords researchers a more comprehensive evaluation framework for code LLMs.
- Fine-Grained Annotations: Two levels of annotations, bucket-level and semantic-level, are employed. These annotations provide insights into the completion capabilities of code LLMs across different contexts and complexities, based on an abstract syntax tree (AST).
- Supplementary Dataset: The creation of M2rc-Instruct, a multilingual instruction corpus, supports the enhancement of repository-level code completion models.
- Comprehensive Evaluation: The experiments demonstrate the efficacy of M2rc-Eval in gauging the abilities of popular code LLMs like StarCoder, DeepSeekCoder, and Code Llama across various evaluation metrics.
Results and Observations
The experimental results highlight the advantage of incorporating cross-file contexts in code completion tasks. Utilizing supplementary contexts significantly improves model performance, underscoring the importance of repository-level understanding in LLMs. Notably, fine-tuning with the M2rc-Instruct dataset achieves considerable improvements, indicating that specific instruction tuning can enhance cross-language code completion capabilities.
Additionally, the study reveals differing performance levels of code LLMs across programming languages and annotation types. This suggests that language-specific syntactic constructs influence model efficacy. The bucket-level results indicate declining performance with increasingly shallow nodes, whereas semantic annotations reveal strong performance in handling identifiers but weaknesses with special language structures.
Implications and Future Directions
The research contributes fundamentally to the field of multilingual code intelligence by providing tools and resources to better evaluate and develop code LLMs. The findings suggest promising directions for future research, such as the development of models that are better suited to handle syntactic and semantic nuances across diverse programming languages.
The introduction of detailed annotations opens pathways for more granular insights into model performance, offering opportunities to refine model architectures and training regimens further. The reliance on textual metrics like EM and ES, while indicative, hints at the need for execution-based evaluations to capture true semantic equivalence.
In conclusion, M2rc-Eval offers a substantial step forward in the assessment of LLM capabilities, promoting advancements in code completion technologies and software automation across multilingual environments. As the field progresses, such comprehensive evaluations will be crucial in aligning LLM development with practical software engineering needs.