SynthFM-3D: Synthetic Medical Volume Generation
- SynthFM-3D is an analytical framework generating synthetic 3D volumetric data with controlled anatomical structure, contrast, and noise variability.
- It employs recursive mask evolution and multi-scale Gaussian texture modeling to mimic imaging characteristics across CT, MR, and ultrasound modalities.
- The framework enables training of promptable segmentation models like SAM 2 with strong zero-shot and few-shot performance improvements over supervised methods.
SynthFM-3D is an analytical framework designed for generating synthetic 3D volumetric data with mathematically controlled variability in anatomical structure, contrast, boundary sharpness, and noise. Developed to address the domain gap between natural and medical image statistics in deep foundation models, SynthFM-3D enables the training of promptable segmentation models—such as SAM 2—using exclusively synthetic data. The framework achieves strong zero-shot generalization across modalities (CT, MR, ultrasound) and anatomical structures, outperforming supervised medical segmentation models when real annotations are scarce or unavailable (Chakrabarty et al., 18 Jan 2026).
1. Mathematical Parameterization of Volumetric Variability
The core of SynthFM-3D is a family of parametric generative models for 3D medical image volumes. Anatomical label volumes are produced by recursively evolving per-slice binary masks for each class :
Here, , (growth/shrinkage rates), (slice at which erosion begins), and (organ “birth” flag) are randomly sampled structural parameters for each label class.
The framework employs an explicit intensity model by layering multi-scale Gaussian textures and randomly sampled per-label intensity scales . At each slice depth, partial-volume effects are achieved by blurring masks using a Gaussian kernel with scale . Backgrounds and per-label textures are perturbed by Gaussian noise, controlling contrast and apparent noise statistics:
Structural and appearance parameters are sampled independently, yielding high diversity of 3D volumetric realizations per input 2D seed mask. All image volumes are finally normalized to the range for compatibility with image model input conventions (Chakrabarty et al., 18 Jan 2026).
2. Synthetic Data Generation Pipeline
A typical generation pipeline for a synthetic volume proceeds as follows:
- Input: 2D multi-class seed mask , target depth .
- Parameter Sampling: Randomly sample shape and appearance parameter sets .
- Shape Synthesis: Initialize , then evolve along the depth dimension by stochastic morphological growth/shrinkage.
- Label Map Composition: At each , assign class labels exclusively by mask priority.
- Texture and Intensity Sampling: For each class, construct multi-scale textures and sample intensity scales.
- Appearance Synthesis: At each depth, blend background noise, textures, and blurred label masks to form slice .
- Normalization: Linearly scale the volume to the standard intensity range.
Because the itemized randomizations cover geometric, radiometric, and partial-volume factors, the generated dataset collectively spans broad anatomical and appearance diversity, approximating cross-modality (CT, MR, US) variation.
3. Model Tuning Protocol with SynthFM-3D
SynthFM-3D is explicitly designed for training foundation models—particularly promptable segmentation architectures—without recourse to real medical image labels. The SAM 2.1 (Hiera-B+) checkpoint is fine-tuned on a set of 10,000 synthetic SynthFM-3D volumes using the default SAM 2 hyperparameters (learning rate, optimizer configuration, batch size). Only synthetic data is used; real images and annotations are not incorporated at any phase.
Prompts at training time include:
- Single-click (object centroid) and triple-click (centroid plus two random interior clicks).
- Prompt-targeted mask prediction loss (IoU-based), as in the original SAM 2 formulation.
No additional forms of regularization or task-specific adaptation are introduced. Training is conducted for 40 epochs, which matches the schedule used in the VideoSAM variant of SAM 2 (Chakrabarty et al., 18 Jan 2026).
4. Empirical Results and Comparative Performance
SynthFM-3D’s pre-training regime is evaluated in zero-shot and few-shot paradigms across three medical imaging modalities (CT, MR, ultrasound) and five public datasets (TotalSegmentator, AMOS, CAMUS). Eleven anatomical structures are assessed, ranging from abdominal organs to cardiac chambers.
Zero-shot results: When fine-tuned on SynthFM-3D volumes, SAM 2 exhibits substantial Dice Similarity Coefficient (DSC) increases over the vanilla SAM 2 baseline:
| Modality | Structure | SAM 2 DSC | SynthFM-3D DSC |
|---|---|---|---|
| CT | Kidney (L) | 63.5 | 79.9 |
| CT | Spleen | 44.7 | 83.2 |
| MR | Bladder | 7.2 | 72.2 |
| MR | Spleen | 31.8 | 71.9 |
| US | Left Atrium | 19.5 | 57.4 |
| US | LV Endocardium | 27.5 | 63.8 |
On cardiac ultrasound (CAMUS), SynthFM-3D achieves 2–3× higher Dice than the supervised SAM-Med3D, with quantitative improvements (e.g., LA 25.8 vs 57.4, LV Endocardium 21.6 vs 63.8). In few-shot fine-tuning with only 2–5 annotated samples, SynthFM-3D pre-training closes the gap to the fully supervised nnU-Net in some structures (Chakrabarty et al., 18 Jan 2026).
No formal ablation of individual synthetic variability components is reported, leaving open the precise contribution of shape versus intensity randomness.
5. Architectural Insights and Core Principles
SynthFM-3D’s success is ascribed to several design traits:
- Explicit 3D shape variability: Recursive slice-wise mask operations (growth/shrinkage, birth/death) ensure the synthetic data captures realistic volumetric continuity and anatomical context—characteristics underrepresented in 2D natural or synthetic training schemes.
- Analytical appearance modeling: The use of multi-scale Gaussian mixtures for texture, sampled intensity scales per anatomical structure, and parametric blur for boundary softening ensures coverage of the contrast and partial-volume statistics seen in CT, MR, and US.
- Noise diversity: Gaussian perturbations at both background and label texture levels confer cross-scanner invariance, enabling promptable models to generalize to new modalities.
- Decoupling structure and appearance: Independent sampling of anatomical and visual parameters yields a combinatorial data manifold, providing diversity beyond what template-driven or weakly-augmented approaches afford.
These principles facilitate the training of foundation models that are insensitive to specific scanner properties, annotation artifacts, or organ shapes.
6. Limitations and Prospects for Extension
Current instantiation of SynthFM-3D employs basic Gaussian noise models and texture mixtures. Real-world imaging phenomena—such as ultrasound speckle, MR field inhomogeneity, or complex soft-tissue boundary artifacts—are not explicitly captured. Potential directions include:
- Physics-based simulation of scanner-specific distortions (e.g., wave-propagation models for US speckle).
- More sophisticated generative models for texture, potentially learned from real distributions.
- Extension to motion sequences (4D data in cardiac or doppler imaging).
- Incorporation of multi-modal synthesis (joint PET-CT simulation).
This suggests that SynthFM-3D can serve as a general analytical backbone for annotation-free domain adaptation in a variety of volumetric imaging modalities.
7. Significance for Foundation Model Training
SynthFM-3D establishes a scalable, annotation-free methodology for bridging the performance gap of promptable segmentation models across biomedical modalities. By sufficiently sampling the 3D parametric space of anatomical and contrast variation, SynthFM-3D makes it possible for segmentation models to learn features that are robust to scanner physics, tissue contrast, and shape statistics. The observed zero-shot and few-shot improvements over both vanilla SAM 2 and advanced supervised methods (e.g., SAM-Med3D) indicate that analytical data synthesis can outperform supervised transfer in out-of-distribution and under-annotated regimes (Chakrabarty et al., 18 Jan 2026). This analytic approach is modality-agnostic and extensible to additional domains and tasks.