CWM: An Open-Weights LLM for Research on Code Generation with World Models

Published 30 Sep 2025 in cs.SE, cs.AI, and cs.LG | (2510.02387v1)

Abstract: We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CWM, a 32-billion parameter LLM that leverages world models to simulate dynamic code execution.
It employs a multi-stage training approach combining pre-training, mid-training with Python traces, and reinforcement learning to refine its reasoning skills.
CWM achieves a 65.8% pass@1 on the SWE-bench Verified benchmark, outperforming several comparable open and closed-weight models.

CWM: An Open-Weights LLM for Research on Code Generation with World Models

Introduction

The paper introduces the Code World Model (CWM), a 32-billion-parameter open-weight LLM designed for code generation with world models. Traditional code generation approaches typically treat code as static text data, allowing models to learn to predict code line-by-line. However, this method lacks the ability to fully understand code execution and its dynamic effects. CWM addresses this limitation by training on extensive observation-action trajectories from Python interpreters and agentic Docker environments. This innovative approach aims to enhance code understanding and reasoning abilities in various computational contexts. Although CWM demonstrates notable promise, the complexities of real-time execution scenarios are yet to be fully evaluated.

Figure 1: Overview of the CWM training stages and the model checkpoints that we release.

Training Methodology

CWM training is divided into multiple stages to ensure comprehensive learning. Initially, the model undergoes traditional pre-training, followed by mid-training using Python execution traces and agentic data, then concluding with reinforcement learning (RL). Mid-training plays a crucial role by exposing the model to dynamic execution traces and agentic interactions at a scale not typically seen in similar models. This phase is instrumental in grounding model predictions within the underlying dynamical systems encountered during code execution.

Figure 2: CWM format for Python traces. Given a source code context and a marker of the trace starting point, CWM predicts a series of stack frames representing the Program states and the actions (executed code).

Reinforcement Learning Approach

The RL phase further refines CWM's capabilities, facilitating its understanding of reasoning required in complex programming tasks. This phase consists of supervised fine-tuning and agentic multi-task RL, where CWM learns to handle various software engineering tasks through tool-use and multi-turn interactions. The RL process leverages a Group Relative Policy Optimization (GRPO) variant, incorporating several recent advancements to maintain efficiency during asynchronous training.

Figure 3: Async RL systems overview. Worker nodes generate trajectory batches from multiple RL environments and send them to trainer nodes via a transfer queue.

Results and Performance

CWM outperforms several large models with both open and closed weights on benchmark tests, demonstrating superior ability to generate and execute code. On the SWE-bench Verified benchmark, CWM achieves pass@1 scores of 65.8% with test-time scaling. Additionally, the model exhibits significant improvements in competitive programming and mathematical reasoning tasks, showcasing its robustness across various code-related evaluations.

Figure 4: On SWE-bench Verified, CWM outperforms open-weight models with similar parameter counts and is competitive with much larger or closed-weight LLMs.

Implications and Future Directions

CWM's development opens several avenues for future exploration in AI-driven code generation with realistic implementation scenarios. By focusing on execution semantics during training, the model provides insights into how reasoning can benefit agentic coding and step-by-step simulation of Python code execution. Further research is encouraged to deepen our understanding of world models and their impact on AI-driven reasoning and planning capabilities.

Conclusion

CWM is a seminal step toward integrating world models into code generation tasks, bridging the gap between static code representation and dynamic execution understanding. Through its layered training approach, CWM showcases how grounding learning in execution dynamics can significantly enhance model capabilities, paving the way for future advancements in intelligent code generation and reasoning AI systems.

Markdown Report Issue