Code Generation with Small Language Models: A Deep Evaluation on Codeforces

Published 9 Apr 2025 in cs.SE | (2504.07343v1)

Abstract: LLMs have demonstrated capabilities in code generation, potentially boosting developer productivity. However, their widespread adoption remains limited by high computational costs, significant energy demands, and security risks such as data leakage and adversarial attacks. As a lighter-weight alternative, Small LLMs (SLMs) offer faster inference, lower deployment overhead, and better adaptability to domain-specific tasks, making them an attractive option for real-world applications. While prior research has benchmarked LLMs on competitive programming tasks, such evaluations often focus narrowly on metrics like Elo scores or pass rates, overlooking deeper insights into model behavior, failure patterns, and problem diversity. Furthermore, the potential of SLMs to tackle complex tasks such as competitive programming remains underexplored. In this study, we benchmark five open SLMs - LLAMA 3.2 3B, GEMMA 2 9B, GEMMA 3 12B, DEEPSEEK-R1 14B, and PHI-4 14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. PHI-4 14B achieved the best performance among SLMs, with a pass@3 of 63.6%, approaching the proprietary O3-MINI-HIGH (86.8%). In addition, we evaluated PHI-4 14B on C++ and found that combining outputs from both Python and C++ increases its aggregated pass@3 to 73.6%. A qualitative analysis of PHI-4 14B's incorrect outputs revealed that some failures were due to minor implementation issues - such as handling edge cases or correcting variable initialization - rather than deeper reasoning flaws.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that Phi-4 14B reaches up to a 73.6% pass@3 score when combining Python and C++ solutions on diverse Codeforces challenges.
The methodology uses pass@k and semantic consistency metrics to rigorously assess performance across multiple difficulty levels and programming topics.
The study highlights that small language models, with minor implementation tweaks, offer a cost-effective alternative to resource-intensive large models in competitive programming.

Code Generation with Small LLMs: A Deep Evaluation on Codeforces

Overview

The paper "Code Generation with Small LLMs: A Deep Evaluation on Codeforces" provides a comprehensive evaluation of Small LLMs (SLMs) in competitive programming tasks. It presents a detailed analysis of five open SLMs—Llama 3.2 3B, Gemma 2 9B, Gemma 3 12B, DeepSeek-R1 14B, and Phi-4 14B—using 280 Codeforces problems across different difficulty levels and topics. The study emphasizes the viability of SLMs as alternatives to LLMs by highlighting their efficiency in terms of computational costs and deployment.

Methodology

The study evaluates the selected SLMs on Codeforces, a platform known for diverse and complex programming challenges. The SLMs were tasked with generating Python code and assessed based on pass rates across different problems. Metrics such as pass@k and semantic consistency were used to gauge the models' performance. Problems were selected based on a wide range of topics and difficulty levels, ensuring a comprehensive evaluation of the models' capabilities.

Findings

Phi-4 14B emerged as the leading model, achieving a pass@3 of 63.6% for Python code, nearing the performance of the proprietary o3-mini-high model, which achieved 86.8%. When both Python and C++ solutions were considered, Phi-4 14B's performance increased to a pass@3 of 73.6%. This indicates that SLMs, particularly Phi-4 14B, demonstrate an effective balance between performance and computational efficiency.

The analysis of erroneous outputs revealed that most of the model's failures arose from minor implementation issues rather than significant reasoning errors. This suggests that with minor corrections, SLMs can potentially match the performance of more advanced models.

Implications and Future Directions

The paper demonstrates the potential of SLMs to act as efficient and viable alternatives to LLMs in specialized tasks such as competitive programming. The highlighted balance of performance and efficiency positions SLMs as promising tools in resource-constrained environments.

Future research could consider higher ELO-rated problems, enhance prompting strategies, explore language-specific optimizations, and develop lightweight tuning techniques. Expanding the evaluation to include more diverse problem sets in other coding platforms could further validate the adaptability and effectiveness of SLMs in different contexts.

Conclusion

The evaluation of code generation capabilities of SLMs on Codeforces challenges underscores their potential as reliable alternatives to more resource-intensive LLMs. While proprietary models currently outperform open models, the study's findings indicate that through careful optimization and the combination of outputs across multiple languages, SLMs like Phi-4 14B can deliver competitive results. This research paves the way for further advancements and applicability of SLMs in real-world programming tasks, offering a direction for future work in AI-assisted coding.