Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks

Published 3 Jul 2025 in cs.SE | (2507.03160v4)

Abstract: The recent advancements of Small LLMs (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to LLMs, making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks.

Abstract PDF Upgrade to Chat

Summary

The paper presents an empirical study evaluating 20 open-source small language models on five code generation benchmarks, revealing performance differences grouped by model size.
It employs a unified zero-shot experimental setup to measure accuracy, VRAM usage, and inference time, thereby quantifying efficiency trade-offs.
The study finds that optimized smaller models can rival larger ones, offering significant benefits for resource-constrained coding environments.

An Empirical Study of Small LLMs for Code Generation

Introduction

The paper "Assessing Small LLMs for Code Generation: An Empirical Study with Benchmarks" (2507.03160) presents a comprehensive empirical analysis of Small LLMs (SLMs) within the context of code generation tasks. With the growing utilization of lightweight models due to their efficiency, this study evaluates 20 open-source SLMs, each ranging from 0.4 billion to 10 billion parameters, across five diverse code-related benchmarks. This research aims to understand the balance between performance and memory efficiency offered by these models.

Methodology

The methodological approach comprises a structured evaluation pipeline divided into three phases: model and benchmark selection, unified experimental setup, and systematic data analysis.

Model and Benchmark Selection

The study includes 20 open-source decoder-only SLMs, selected based on criteria such as release timeline, community engagement, and open-source licensing. These models are divided into three groups by parameter size for direct comparison:

Group 1: Up to 1.5B parameters
Group 2: More than 1.5B to 3B parameters
Group 3: More than 3B to 10B parameters

Five benchmarks were chosen based on their ability to cover a wide scope of code generation tasks and support multiple programming languages. These include HumanEval, MBPP, Mercury, CodeXGLUE, and HumanEvalPack.

Figure 1: Overview of the research design and evaluation workflow.

Experimental Setup

The study employs a zero-shot prompting technique to assess model performance devoid of prior context, under specified decoding configurations. Two distinct hardware platforms with compatible VRAM and CPUs were used to ensure comprehensive evaluation of VRAM usage and inference time, managed by an evaluation framework for consistency.

Experimental Results

The experimental evaluation provides insight into the performance, efficiency, and multilingual capabilities of SLMs.

Code Generation Performance and Stability

The results indicate that the largest models typically exhibit superior functional correctness. However, several smaller SLMs from Groups 1 and 2 achieve competitive accuracy, underscoring that robust design and optimization can compensate for smaller parameter counts. The study concludes that model size significantly influences performance across the three major coding benchmarks without being heavily dependent on benchmark variability.

Figure 2: Standard deviation(SD) and coefficient of variation(COV) of the models. Subplots show the scores for (a) pass@1, (b) pass@2, (c) pass@5, and (d) pass@10.

Performance-Efficiency Trade-offs

The analysis reveals a non-linear relationship between model size and VRAM usage, highlighting the scalability challenges larger models face in resource-constrained environments. Significantly, some smaller SLMs demonstrate a favorable balance between accuracy and computational efficiency, making them ideal for memory-limited applications.

Figure 3: Performance-Efficiency Trade-offs: pass@1 vs. Inference Time with VRAM Usage.

Multilingual Consistency

Language-specific evaluations using benchmarks such as HumanEvalPack show better performance for languages like Python, Java, and PHP, whereas languages like C++ and Ruby present more challenges due to their syntactic complexity. Nonetheless, statistical analyses confirm that these performance variations across languages are generally insignificant.

Figure 4: Distribution of pass@k scores across programming languages in the HumanEvalPack benchmark.

Figure 5: Frequency of programming languages appearing in top BLEU score rankings in the CodeXGLUE Benchmark.

Discussion

The study illuminates several key observations:

Model Size vs. Performance: Smaller SLMs can approximate the performance of larger models when optimized effectively, suggesting potential benefits of architectural refinements over simple parameter scaling.
Efficiency Trade-offs: Pragmatic deployment strategies might prioritize models demonstrating an optimal balance between accuracy and memory efficiency, pertinent for environments characterized by resource constraints.
Multilingual Ambiguities: Although variations exist, the generalization across programming languages is broadly consistent, emphasizing a model's capacity to maintain efficacy in diverse coding environments.

Conclusion

This study advances our understanding of SLMs for code generation, revealing that strategic architectural enhancements and training regimens can achieve high performance even with limited parameters. Future investigations might focus on expanding model robustness and dynamic task adaptability beyond the constraints of current benchmark assessments. These findings underscore the potential of SLMs to reconcile performance, efficiency, and versatility in practical applications.

Markdown Report Issue