Improve transfer of RWML world-model knowledge in weaker base LLMs

Develop training strategies within Reinforcement World Model Learning (RWML) that enable weaker base large language models, such as Qwen2.5-7B, to effectively transfer action-conditioned world-model knowledge learned via sim-to-real gap rewards to downstream decision-making in long-horizon, tool-use environments like T2-Bench. The goal is to mitigate the observed dependence on base model capability by enhancing the transfer ability in weaker models.

Background

The paper reports that on the challenging T2-Bench environment, the benefits of RWML—training an LLM to align simulated next states with observed environment states using embedding-based rewards—depend on the capability of the base model. Weaker models (e.g., Qwen2.5-7B) struggle to transfer learned world-model knowledge to decision-making, whereas stronger models (e.g., Qwen3-8B, Qwen3-30B-A3B) exhibit substantial gains.

This limitation constrains the scalability of RWML to smaller or less capable models. Addressing the transfer gap would extend RWML’s applicability and robustness, enabling broader adoption without relying on high-capacity base models.

References

On the challenging 72 bench, we find the ability to learn and transfer world model knowledge from RWML to decision-making is dependent on the capability of the base model. We leave improving transfer abilities for weaker models to future work.

Reinforcement World Model Learning for LLM-based Agents  (2602.05842 - Yu et al., 5 Feb 2026) in Section 4.3 (Impact of Base Model Capability)