Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing
Abstract: The rapid advancement of LLMs has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.
- Mark Chen et al. Evaluating large language models trained on code. 2021.
- Yujia Li et al. Competition-level code generation with alphacode. Science, 378:1092–1097, 2022.
- Erik Nijkamp et al. Codegen: An open large language model for code with multi-turn program synthesis. 2022.
- Github copilot · your ai pair programmer, 2021. URL https://copilot.github.com/.
- Xunyu Zhu et al. A survey on model compression for large language models. 2023.
- Zechun Liu et al. Llm-qat: Data-free quantization aware training for large language models. 2023.
- Xinyin Ma et al. Llm-pruner: On the structural pruning of large language models. 2023.
- Cheng-Yu Hsieh et al. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. 2023.
- Tri Dao et al. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Erik Nijkamp et al. Codegen2: Lessons for training llms on programming and natural languages. 2023.
- Raymond Li et al. Starcoder: may the source be with you! 2023.
- Daya Guo et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. 2024.
- Bei Chen et al. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023a.
- Baptiste Rozière et al. Code llama: Open foundation models for code, 2023.
- Ziyang Luo et al. Wizardcoder: Empowering code large language models with evol-instruct. 2023.
- Lingjiao Chen et al. Frugalgpt: How to use large language models while reducing cost and improving performance. 2023b.
- Victor Sanh et al. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2020.
- Yaniv Leviathan et al. Fast inference from transformers via speculative decoding. 2023.
- Weimin Xiong et al. The program testing ability of large language models for code. 2023.
- RunPod. Gpu cloud service, 2023. URL http://runpod.io.
- Daniel Fried et al. Incoder: A generative model for code infilling and synthesis. 2022.
- Qinkai Zheng et al. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. 2023.
- Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. 2023.
- Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. Github repository. https://github.com/sahil280114/codealpaca, 2023.
- Jacob Austin et al. Program synthesis with large language models. 2021.
- Dan Hendrycks et al. Measuring coding challenge competence with apps. 2021.
- Thomas Wolf et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, 2020.
- Sylvain Gugger et al. Accelerate: Training and inference at scale made simple, efficient and adaptable. Github repository. https://github.com/huggingface/accelerate, 2022.
- Woosuk Kwon et al. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, page 611–626, 2023.
- Qidong Su et al. The synergy of speculative decoding and batching in serving large language models. 2023.
- Charlie Chen et al. Accelerating large language model decoding with speculative sampling. 2023c.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.