Papers
Topics
Authors
Recent
Search
2000 character limit reached

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Published 9 Sep 2025 in cs.AI | (2509.07858v1)

Abstract: Existing code LLMs often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.

Summary

  • The paper demonstrates an iterative self-distillation framework that enables small-scale open-source LLMs to synthesize high-quality code instruction data.
  • It achieves competitive performance on benchmarks like HumanEval and MBPP while reducing dependency on expensive proprietary LLMs.
  • The methodology, validated through ablation studies and theoretical analysis, offers a cost-efficient and rapidly converging approach to code LLM development.

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Introduction

The paper introduces SCoder, a methodology and model family that addresses the high cost and dependency on proprietary LLMs for code instruction data synthesis in code LLM development. The central contribution is an iterative self-distillation framework that enables small-scale open-source LLMs (7B–14B parameters) to serve as effective code instruction data synthesizers. This approach reduces reliance on large, closed-source models (e.g., GPT-3.5/4) and demonstrates that small models, when properly bootstrapped, can generate high-quality instruction data for code LLM fine-tuning.

Motivation and Problem Statement

Instruction tuning is critical for code LLMs, but the prevailing paradigm depends on large-scale, high-quality instruction datasets distilled from proprietary LLMs. This process is cost-prohibitive and limits accessibility. The paper investigates whether small-scale open-source LLMs can be transformed into competitive data synthesizers, thus democratizing the construction of code instruction datasets and reducing costs.

Methodology

Data Synthesizer Bootstrapping

The process begins by training small-scale LLMs (e.g., Qwen2.5-Coder-7B/14B, Llama3.1-8B) on a limited set of high-quality instruction data distilled from proprietary LLMs. This initial "enhanced synthesizer" is then iteratively improved via self-distillation, eliminating further dependence on proprietary data.

Iterative Self-Distillation Framework

Each iteration of the self-distillation process consists of:

  1. Multi-Checkpoint Sampling: For each code snippet and prompt, outputs are sampled from multiple checkpoints and multiple decoding runs, increasing diversity and robustness.
  2. Multi-Aspect Scoring: Candidate outputs are evaluated using a learned scorer that aggregates multiple quality aspects (e.g., problem-solution consistency, correctness) into a weighted score. The weights are optimized via ridge regression to maximize downstream code LLM performance.
  3. Gradient-Based Influence Estimation: To select the most influential samples, the cosine similarity between the gradient induced by a candidate sample and the average gradient of proprietary LLM-distilled samples is computed (using LoRA-adapted reference models and Johnson-Lindenstrauss projections for efficiency). Samples with the highest influence are retained for the next training iteration.

This process is repeated, with each iteration generating a larger, higher-quality self-distilled dataset, which is then used to further train the synthesizer.

SCoder Model Family

Using the instruction datasets generated by the bootstrapped synthesizers, the authors fine-tune DeepSeek-Coder-6.7B-Base to produce the SCoder family. This family includes variants corresponding to different synthesizer backbones (e.g., SCoder-Q7-DS-6.7B, SCoder-Q14-DS-6.7B).

Experimental Results

Benchmarks and Baselines

SCoder models are evaluated on HumanEval, MBPP, LiveCodeBench, and BigCodeBench, using pass@1 as the primary metric. Baselines include both proprietary models (GPT-4-Turbo, GPT-o1) and state-of-the-art open-source models (DeepSeek-Coder-6.7B-Instruct, MagicoderS, WizardCoder-GPT-4, etc.).

Main Findings

  • Performance: SCoder models trained on 60K–80K self-distilled samples from small synthesizers match or outperform open-source baselines that rely on 75K–110K proprietary LLM-distilled samples. For example, SCoder-Q14-DS-6.7B achieves 80.5% on HumanEval and 81.0% on MBPP, surpassing all open-source baselines of comparable size.
  • Ablation Studies: Removing multi-checkpoint sampling, multi-aspect scoring, or gradient-based influence estimation leads to significant performance drops (up to 8.9% on BigCodeBench), confirming the necessity of each component.
  • Data Scaling: Increasing the size of self-distilled data leads to monotonic improvements, with diminishing returns after two iterations, indicating convergence of the self-distillation process.
  • Cost Efficiency: The approach reduces proprietary LLM API usage by an order of magnitude (10K vs. 150K–200K samples), with the main cost being the one-time fine-tuning of the synthesizer. The total cost for synthesizer training is estimated at ~$260 on commodity cloud GPUs, compared to thousands of dollars for equivalent proprietary LLM API usage.

Data Quality Analysis

Human and LLM-based evaluations show that the self-distilled data from bootstrapped synthesizers scores higher across all quality aspects compared to standard open-source instruction datasets (e.g., evol-codealpaca-v1).

Theoretical Analysis

The paper provides a formal analysis of the iterative self-distillation process, modeling it as a contraction mapping in the space of model parameters. Under reasonable Lipschitz continuity and contraction assumptions, the process is shown to converge to a unique fixed point, which can be interpreted as a Nash equilibrium between the teacher (synthesizer) and student (target model). The process naturally balances exploration (diverse data generation) and exploitation (retraining from a fixed initialization), with empirical results supporting rapid convergence and stability.

Implementation Considerations

  • Synthesizer Training: Fine-tuning small LLMs on 10K proprietary samples, followed by two rounds of self-distillation (20K and 40K samples), is sufficient for strong performance.
  • Sampling and Scoring: Multi-checkpoint and multi-aspect strategies require additional inference passes but are computationally tractable for 7B–14B models.
  • Gradient Influence: LoRA adaptation and gradient projection make influence estimation feasible on a single A100 GPU within hours.
  • Target Model Fine-Tuning: SCoder models are fine-tuned on a mix of standard open-source and self-distilled data, using standard SFT hyperparameters.

Implications and Future Directions

Practical Implications

  • Democratization: The methodology enables organizations without access to proprietary LLMs to build competitive code LLMs using only small open-source models and a modest initial investment in proprietary data.
  • Cost Reduction: The approach dramatically reduces the cost of instruction data synthesis, making large-scale code LLM development more accessible.
  • Generalization: The framework is robust to the choice of reference model for influence estimation and generalizes across different target model architectures.

Theoretical Implications

  • Self-Distillation Dynamics: The formal analysis provides a foundation for understanding convergence and stability in iterative self-distillation, with potential applications beyond code generation.
  • Sample Selection: The integration of gradient-based influence estimation with multi-aspect scoring offers a principled approach to data selection in self-supervised and semi-supervised learning.

Future Work

  • Extension to Other Domains: While the current study focuses on code generation, the methodology may be adapted to other instruction-following tasks, though domain-specific challenges (e.g., data availability, evaluation) must be addressed.
  • Integration with Alternative Paradigms: Combining self-distillation with methods like Self-Instruct or Evol-Instruct could further enhance data diversity and quality.
  • Scaling Laws and Model Size: Further exploration of the relationship between synthesizer size, data quality, and downstream performance is warranted.

Conclusion

SCoder demonstrates that small-scale open-source LLMs, when bootstrapped via iterative self-distillation, can serve as effective and efficient code instruction data synthesizers. This approach enables the construction of high-quality instruction datasets at a fraction of the cost and dependency of prior methods, yielding code LLMs that match or exceed the performance of models trained on large-scale proprietary data. The methodology is theoretically grounded, empirically validated, and broadly applicable, representing a significant advance in scalable, accessible code LLM development.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.