Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'
Abstract: Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of LLMs beyond standalone benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs have similar performance in real world coding tasks as their performance in these benchmarks? Unfortunately, one cannot answer this question, since these benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks. To address these challenges, we create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects and appropriate metrics for evaluating source code. It includes 980 whole-function generation tasks from 11 popular projects, 50.8% of which require repository-level context. REPOCOD includes 314 developer-written test cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and find that none achieves more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. In addition, we found that retrieval-augmented generation achieves better results than using target function dependencies as context.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13.
- Program synthesis with large language models. ArXiv, abs/2108.07732.
- Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421.
- Long code arena: a set of benchmarks for long-context code models. Preprint, arXiv:2406.11612.
- tree-sitter/tree-sitter: v0.22.6.
- Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
- Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. Preprint, arXiv:2304.02014.
- Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. Preprint, arXiv:2310.11248.
- Evaluating large language models in class-level code generation. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 982–994, Los Alamitos, CA, USA. IEEE Computer Society.
- The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
- Measuring coding challenge competence with apps. NeurIPS.
- A deep dive into large language models for automated bug localization and repair. ArXiv, abs/2404.11595.
- Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1430–1442.
- SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1646–1656.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Improved code summarization via a graph neural network. In Proceedings of the 28th International Conference on Program Comprehension, ICPC ’20, page 184–195, New York, NY, USA. Association for Computing Machinery.
- Enabling programming thinking in large language models toward code generation. arXiv preprint arXiv:2305.06599.
- Starcoder: may the source be with you! Preprint, arXiv:2305.06161.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, volume 36, pages 21558–21572. Curran Associates, Inc.
- Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations.
- Starcoder 2 and the stack v2: The next generation. Preprint, arXiv:2402.19173.
- Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations.
- T.J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering, SE-2(4):308–320.
- Codegen: An open large language model for code with multi-turn program synthesis. Preprint, arXiv:2203.13474.
- Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828.
- Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109.
- Codebleu: a method for automatic evaluation of code synthesis. Preprint, arXiv:2009.10297.
- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
- Unsupervised translation of programming languages. In Advances in Neural Information Processing Systems, volume 33, pages 20601–20611. Curran Associates, Inc.
- Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Repoformer: Selective retrieval for repository-level code completion. Preprint, arXiv:2403.10059.
- Top leaderboard ranking= top coding proficiency, always? evoeval: Evolving coding benchmarks via llm. arXiv preprint arXiv:2403.19114.
- Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494.
- Better test cases for better automated program repair. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, page 831–841, New York, NY, USA. Association for Computing Machinery.
- Exploring and unleashing the power of large language models in automated code translation. Proc. ACM Softw. Eng., 1(FSE).
- Retrieval-based neural source code summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, page 1385–1397, New York, NY, USA. Association for Computing Machinery.
- Learning-based widget matching for migrating gui test cases. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, volume 66 of ICSE ’24, page 1–13. ACM.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.