Dataset structuring for generalizable multimodal reasoning across tasks

Determine how to structure multimodal training datasets to induce generalizable representations across diverse reasoning tasks when training a single model to be simultaneously proficient at mathematics and computer‑use.

Background

The authors aim to train one model that excels at both mathematical/scientific reasoning and computer‑use (GUI grounding and interaction). They highlight that data composition and scaling strategies may drive different design decisions, such as using a single model versus multiple specialized models.

They observe promising cross‑task transfer at modest scales in their ablations but state that the broader question of how to design datasets to yield generalizable representations across disparate reasoning domains is unresolved.

References

It is an open question in the research community to understand how datasets should be structured to induce generalizable representations across diverse reasoning tasks.

Phi-4-reasoning-vision-15B Technical Report  (2603.03975 - Aneja et al., 4 Mar 2026) in Section 3.2 (Mathematics and Science vs. Computer-Use Data Proportion)