Papers
Topics
Authors
Recent
Search
2000 character limit reached

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Published 8 Jul 2024 in cs.CL, cs.AI, and cs.SE | (2407.05700v2)

Abstract: Recent advancements in open-source code LLMs have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  4. Improving image generation with better captions. URL https://api.semanticscholar.org/CorpusID:264403242.
  5. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
  6. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  8. Self-play fine-tuning converts weak language models to strong language models, 2024.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
  10. Pangu-coder: Program synthesis with function-level language modeling. ArXiv, abs/2207.11280, 2022. URL https://api.semanticscholar.org/CorpusID:251040785.
  11. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  12. Stepcoder: Improve code generation with reinforcement learning from compiler feedback, 2024.
  13. Incoder: A generative model for code infilling and synthesis. ArXiv, abs/2204.05999, 2022. URL https://api.semanticscholar.org/CorpusID:248157108.
  14. Learning instructions with unlabeled data for zero-shot cross-task generalization. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:252918165.
  15. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv, abs/2401.14196, 2024. URL https://api.semanticscholar.org/CorpusID:267211867.
  16. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  17. Language models can teach themselves to program better. ArXiv, abs/2207.14502, 2022. URL https://api.semanticscholar.org/CorpusID:251197051.
  18. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023. URL https://api.semanticscholar.org/CorpusID:261697361.
  19. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR, 2023.
  20. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In NeurIPS, 2022.
  21. Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023a. URL https://api.semanticscholar.org/CorpusID:258588247.
  22. Self-alignment with instruction backtranslation. ArXiv, abs/2308.06259, 2023b. URL https://api.semanticscholar.org/CorpusID:260866107.
  23. Competition-level code generation with alphacode. Science, 378:1092 – 1097, 2022. URL https://api.semanticscholar.org/CorpusID:246527904.
  24. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  25. Rltf: Reinforcement learning from unit test feedback. ArXiv, abs/2307.04349, 2023a. URL https://api.semanticscholar.org/CorpusID:259501019.
  26. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. ArXiv, abs/2305.01210, 2023b. URL https://api.semanticscholar.org/CorpusID:258437095.
  27. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
  28. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  29. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
  30. Microsoft. Github copilot – your ai pair programmer. https://github.com/features/copilot, 2023.
  31. Octopack: Instruction tuning code large language models. ArXiv, abs/2308.07124, 2023. URL https://api.semanticscholar.org/CorpusID:260886874.
  32. Llms for science: Usage for code generation and data analysis. arXiv preprint arXiv:2311.16733, 2023.
  33. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  34. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022.
  35. R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  37. Code llama: Open foundation models for code. ArXiv, abs/2308.12950, 2023. URL https://api.semanticscholar.org/CorpusID:261100919.
  38. Adafactor: Adaptive learning rates with sublinear memory cost. ArXiv, abs/1804.04235, 2018. URL https://api.semanticscholar.org/CorpusID:4786918.
  39. Pangu-coder2: Boosting large language models for code with ranking feedback, 2023.
  40. Execution-based code generation using deep reinforcement learning. ArXiv, abs/2301.13816, 2023. URL https://api.semanticscholar.org/CorpusID:256416258.
  41. Learning performance-improving code edits. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ix7rLVHXyY.
  42. One embedder, any task: Instruction-finetuned text embeddings. ArXiv, abs/2212.09741, 2022. URL https://api.semanticscholar.org/CorpusID:254853816.
  43. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023. URL https://api.semanticscholar.org/CorpusID:258479665.
  44. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898, 2023.
  45. Worldcoder, a model-based llm agent: Building world models by writing code and interacting with the environment. arXiv preprint arXiv:2402.12275, 2024.
  46. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  48. theblackcat102. The evolved code alpaca dataset. https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1, 2023.
  49. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  50. A survey on data selection for llm instruction tuning. ArXiv, abs/2402.05123, 2024. URL https://api.semanticscholar.org/CorpusID:267547917.
  51. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  52. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021.
  53. Codet5+: Open code large language models for code understanding and generation. In Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://api.semanticscholar.org/CorpusID:258685677.
  54. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  55. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  56. Starcoder2-instruct: Fully transparent and permissive self-alignment for code generation, 2024. URL https://github.com/bigcode-project/starcoder2-self-align.
  57. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  58. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258841263.
  59. Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint arXiv:2312.14187, 2023.
  60. Self-rewarding language models, 2024.
  61. Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
  62. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. ArXiv, abs/2303.17568, 2023. URL https://api.semanticscholar.org/CorpusID:257834177.
  63. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.