Effective scaling to billion-parameter radiology foundation models

Investigate why the Curia-2 g Vision Transformer with 1.3 billion parameters does not significantly outperform the Curia-2 L Vision Transformer with 303 million parameters and determine training strategies or architectural modifications that resolve the challenge of effectively scaling self-supervised multi-modal CT and MRI foundation models to the billion-parameter regime.

Background

Curia-2 introduces several modifications to DINOv2 for radiology and scales Vision Transformer architectures from 86M (ViT-B) and 303M (ViT-L) to 1.3B (ViT-g) parameters. The experiments show a clear scaling benefit from ViT-B to ViT-L, but the 1.3B-parameter Curia-2 g achieves performance close to Curia-2 L rather than a clear improvement.

This observation implies that current training recipes and design choices may not fully exploit the capacity of billion-parameter models in radiological self-supervised learning, leaving open the question of what changes are required to realize consistent gains at this scale.

References

Yet, the performance of Curia-2 g remains close to that of Curia-2 L, indicating that the challenge of scaling to billion-parameter models in medical imaging is not yet fully resolved.

Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models  (2604.01987 - Saporta et al., 2 Apr 2026) in Conclusion