Capacity limitations of DeepSeek-LLM 1.3B under mixed code-and-math one-stage training

Determine whether the degradation of mathematical reasoning without tool use observed when training DeepSeek-LLM 1.3B in a single stage on a mixture of code and mathematical data is caused by the model’s limited capacity to simultaneously assimilate both code and mathematical data.

Background

In the Discussion section, the authors compare two-stage training (code then math) against one-stage training where code and math data are mixed. They observe that mixing code and math in a single stage compromises mathematical reasoning without tool use for the 1.3B-parameter model.

To explain this degradation, they explicitly propose a conjecture that the 1.3B model may lack sufficient capacity to learn both modalities simultaneously. This forms an explicit uncertainty that motivates a targeted investigation into capacity limits under mixed training regimes.

References

One conjecture is that DeepSeek-LLM 1.3B, due to its limited scale, lacks the capacity to fully assimilate both code and mathematical data simultaneously.

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models  (2402.03300 - Shao et al., 2024) in Section 6.1, Code Training Benefits Mathematical Reasoning (Results)