A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Published 21 May 2025 in cs.SE and cs.AI | (2505.15469v1)

Abstract: LLMs are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code LLMs in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment \textit{correctness} across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

Abstract PDF Upgrade to Chat

Summary

The paper presents a qualitative analysis of 5 LLMs generating multilingual code comments across 5 languages and introduces a 26-category error taxonomy.
Analysis reveals significant linguistic accuracy decrease in non-English languages for LLMs and shows current automatic metrics are unreliable for evaluating multilingual outputs.
Findings highlight the need for improved training data diversity, linguistically sensitive evaluation tools, and human judgment in assessing multilingual LLM performance.

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

This paper presents an insightful exploration into the capabilities of LLMs for generating code comments across different languages. It provides a qualitative assessment of five state-of-the-art models, namely CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2, for their proficiency in generating multilingual comments. The study spans across five natural languages: Chinese, Dutch, English, Greek, and Polish, highlighting the performance discrepancies and tendencies in error generation among LLMs working outside the predominantly English context.

Key Findings

Error Taxonomy: The paper introduces a taxonomy comprising 26 distinct error categories identified through rigorous analysis of 12,500 model-generated comments. The taxonomy categorizes errors into model-specific, linguistic, semantic, and syntax-related, providing a structured understanding of the common pitfalls encountered in multilingual comment generation.
Language-specific Performance Variances: An analysis of the models reveals a substantial decrease in linguistic accuracy, especially in non-English languages. Among all errors, those related to grammatical nuances of target languages, such as Greek and Polish, showed a remarkable increase when compared to English. This suggests a need for models to better integrate the linguistic intricacies of non-English programming contexts.
Discrepancies in Automatic Metrics: Notably, the study questions the reliability of current automatic metrics for evaluating the quality of model outputs. Neural metrics, including embedding and model-based ones, displayed limitations in differentiating between coherent and random outputs, further emphasizing their inadequacy in assessing non-English comment generation.
Expert Judgment Alignment: The findings underline the importance of human evaluation in assessing model performance for multilingual outputs. The divergence observed between metric scores and expert evaluations underscores the necessity for refining automatic evaluation tools to better mirror human judgments.

Implications

The implications of this research are wide-ranging. On the practical front, software engineering practices involving multilingual environments could benefit from refined LLMs that better manage language-specific challenges, particularly in syntactic and semantic comprehension. Theoretically, the paper advances understanding in the cognitive processing differences necessitated by varying language structures within software development tools.

Furthermore, this paper foregrounds the substantial scope for development in metrics used for evaluating LLM outputs. The current inadequacy calls for advancements in model evaluation methodologies to include diverse linguistic bases, ensuring a more holistic approach to model assessment.

Future Research Directions

The paper suggests several avenues for future exploration. There is a pressing need to diversify training data fed into LLMs, enriching models with comprehensive multilingual datasets that accurately reflect diverse real-world programming scenarios. Additionally, the development of reliable, linguistically sensitive evaluation tools is paramount, including the potential adaptation of metrics to better handle nuanced language discrepancies.

Conclusion

This study marks a significant contribution to understanding the challenges faced by LLMs in generating non-English code comments. By providing a detailed qualitative analysis and identifying shortcomings in current model evaluations, it lays the groundwork for future enhancements in the field of multilingual AI-driven software tools. Researchers and practitioners alike are encouraged to leverage this investigation to push the boundaries of multilingual support in code models.

Markdown Report Issue