Papers
Topics
Authors
Recent
Search
2000 character limit reached

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Published 24 Aug 2024 in cs.SE, cs.AI, and cs.PL | (2408.14504v1)

Abstract: LLMs (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation capabilities, in addition to functional correctness. Despite its practical implications, there is a lack of studies focused on assessing the diversity of generated code, which overlooks its importance in the development of code LMs. We propose a systematic approach to evaluate the diversity of generated code, utilizing various metrics for inter-code similarity as well as functional correctness. Specifically, we introduce a pairwise code similarity measure that leverages large LMs' capabilities in code understanding and reasoning, demonstrating the highest correlation with human judgment. We extensively investigate the impact of various factors on the quality of generated code, including model sizes, temperatures, training approaches, prompting strategies, and the difficulty of input problems. Our consistent observation of a positive correlation between the test pass score and the inter-code similarity score indicates that current LMs tend to produce functionally correct code with limited diversity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Program synthesis with large language models.
  2. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  7. Towards understanding the capability of large language models on code clone detection: a survey. arXiv preprint arXiv:2308.01191.
  8. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics.
  9. Deepseek-coder: When the large language model meets programming – the rise of code intelligence.
  10. William Harding and Matthew Kloster. 2024. Coding on copilot: 2023 data suggests downward pressure on code quality. https://www.gitclear.com/coding_on_copilot_data_shows_ais_downward_pressure_on_code_quality.
  11. Measuring coding challenge competence with APPS. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  12. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  13. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436.
  14. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
  15. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  16. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329.
  17. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  18. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associates, Inc.
  19. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  20. Starcoder 2 and the stack v2: The next generation.
  21. Codexglue: A machine learning benchmark dataset for code understanding and generation.
  22. Daniel Perez and Shigeru Chiba. 2019. Cross-language clone detection by learning over abstract syntax trees. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 518–528.
  23. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  24. Code generation with alphacodium: From prompt engineering to flow engineering. arXiv preprint arXiv:2401.08500.
  25. Chanchal Kumar Roy and James R Cordy. 2007. A survey on software clone detection research. Queen’s School of computing TR, 541(115):64–68.
  26. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  27. Code llama: Open foundation models for code.
  28. Sourcerercc: Scaling code clone detection to big-code. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pages 1157–1168.
  29. ReCode: Robustness evaluation of code generation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13818–13843, Toronto, Canada. Association for Computational Linguistics.
  30. Dolphcoder: Echo-locating code large language models with diverse and multi-objective instruction tuning. arXiv preprint arXiv:2402.09136.
  31. Mconala: a benchmark for code generation from multiple natural languages. arXiv preprint arXiv:2203.08388.
  32. Morteza Zakeri-Nasrabadi and Saeed Parsa. 2022. An ensemble meta-estimator to predict source code testability. Appl. Soft Comput., 129(C).
  33. A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges. Journal of Systems and Software, page 111796.
  34. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372.
  35. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.