Atlas-Free Voxel-Level Models
- Atlas-free voxel-level models represent volumetric data without pre-defined anatomical templates, ensuring unbiased and native spatial fidelity.
- They employ advanced architectures like 3D U-Nets, Vision Transformers, and FPNs with self-supervised and contrastive learning techniques.
- These models enhance segmentation, classification, and generalization in 3D medical imaging, neuroimaging, and materials science applications.
Atlas-free voxel-level foundation models constitute a paradigm within machine learning that enables direct modeling and representation of volumetric data at native spatial resolutions, without recourse to external anatomical templates or spatial atlases. These models are predominantly applied in 3D medical imaging, neuroimaging, and materials science, and are characterized by their ability to generalize, segment, or encode arbitrary input volumes on a per-voxel basis, relying entirely on data-driven priors and learned hierarchical features. The absence of explicit atlas-based alignment distinguishes them from traditional region-of-interest (ROI) or atlas-parcellated approaches, ensuring unbiased spatial representation and maximal spatial fidelity.
1. Foundational Concepts and Motivations
Atlas-free, voxel-level foundation models address the need for unbiased, generalizable volumetric representation learning in domains where spatial correspondence to a template is infeasible, potentially misleading, or computationally prohibitive. In 3D medical imaging and neuroimaging, imposing an atlas or spatial template may introduce interpolation artifacts, anatomical bias, or fail to account for population-level variability—especially in pathologic or cross-modality settings. These models forgo any spatial normalization, external coordinate warping, or handcrafted region definitions. Instead, they employ strategies that capture multi-scale, local-to-global information across the full native voxel grid, learning representations that transfer across downstream tasks such as classification, regression, segmentation, or structural analysis (An et al., 11 Jul 2025, Wang et al., 26 Dec 2025, Wang et al., 30 Jan 2026, He et al., 2024).
2. Model Architectures and Embedding Strategies
Several families of atlas-free voxel-level models have emerged, differing in their neural architectures, pretraining objectives, and input paradigms:
- 3D U-Net and CNN Backbones: Models such as vesselFM and VISTA3D use deep 3D U-Nets with instance normalization, skip connections, and multi-resolution feature maps, enabling direct segmentation and per-voxel prediction (Wittmann et al., 2024, He et al., 2024).
- Vision Transformers (ViT) and Hybrid Architectures: TAP-CT adapts 3D ViTs with patch embedding and depth-aware positional encodings, supporting volumetric masked or contrastive self-supervised learning (Veenboer et al., 30 Nov 2025). Omni-fMRI and SLIM-Brain employ transformer-based tokenizers with hierarchical or dynamic 3D partitioning schemes (Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025).
- Feature Pyramid Networks (FPNs): vox2vec constructs per-voxel embeddings by concatenating features from multi-scale FPN levels, supporting fine-grained as well as global semantic representation (Goncharov et al., 2023).
- Masked Autoencoders (MAE) and Contrastive Learning: A common objective is to reconstruct masked subvolumes (patches, units) or maximize agreement between representations from different augmentations of the same voxel, as in SLIM-Brain's Hiera-JEPA or vox2vec’s InfoNCE contrastive loss (Wang et al., 26 Dec 2025, Goncharov et al., 2023).
- Random Planar Reduction and 2D-3D Fusion: Raptor generates volumetric embeddings by embedding triaxial 2D slices using frozen 2D foundation models and compressing them via random projections, leveraging the Johnson–Lindenstrauss lemma for geometric distance preservation (An et al., 11 Jul 2025).
- Atlas-free in Materials Science: Polycrystal foundation models operate on raw quaternion fields using 3D ViT masked autoencoders, aggregating spatial and orientational coherence without morphological templates (Wei et al., 7 Dec 2025).
The distinguishing feature remains the commitment to learning on the native voxel grid (no affine alignment or parcellation), with only local spatial pooling, patchification, or random spatial masking as architectural biases.
3. Training Regimes and Data Pipelines
Atlas-free models are typically trained under large-scale self-supervised or multi-source supervised paradigms:
- Self-Supervised Learning (SSL): Masked reconstruction (MAE), hierarchical contrastive frameworks (vox2vec, Adam), and teacher–student momentum (DINOv2 in TAP-CT) are widely employed. In SSL, the foundation model may be pretrained on hundreds of thousands of volumes (e.g., TAP-CT: 105K CTs; polycrystal informatics: 100K microstructures) (Veenboer et al., 30 Nov 2025, Wei et al., 7 Dec 2025).
- Synthetic Data and Domain Randomization: Models such as SynthFM-3D and vesselFM use mathematically-parameterized generators to synthesize anatomically and contrast-diverse volumes, supporting analytical control over label evolution, appearance, and noise, and enabling zero/few-shot generalization to modalities absent from the real training set (Chakrabarty et al., 18 Jan 2026, Wittmann et al., 2024).
- Multi-branch and Multi-modal Training: VISTA3D integrates “prompt-indexed" automatic heads and supervoxel-distilled interactive heads, combining robust class-based annotation pipelines with zero-shot region segmentation via distilled supervoxels from 2D backbones (He et al., 2024).
- Dynamic or Adaptive Subsampling: To mitigate the high memory/compute demand, mechanisms such as dynamic patch partitioning (Omni-fMRI), top-k temporal window selection (SLIM-Brain), and selective patch merging are employed to focus learning capacity on salient or information-rich subvolumes (Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025).
Typical implementations eschew anatomical registration, spatial normalization, or handcrafted label spaces, instead building all spatial and category priors from data or synthetic generation.
4. Evaluation Metrics, Benchmarks, and Empirical Performance
Performance of atlas-free voxel-level foundation models is assessed on diverse 3D benchmarks, often spanning multiple domains and tasks. Key metrics include:
- Per-voxel segmentation accuracy (Dice coefficient, centerline Dice, mean IoU): Used in vesselFM, VISTA3D, TAP-CT, vox2vec (Wittmann et al., 2024, He et al., 2024, Veenboer et al., 30 Nov 2025, Goncharov et al., 2023).
- Classification/Regression on volumetric descriptors: AUROC, accuracy, r², or MSE for disease prediction, structure quantification, or demographic inference (e.g., Raptor on MedMNIST, Omni-fMRI on fMRI phenotypes) (An et al., 11 Jul 2025, Wang et al., 30 Jan 2026).
- Zero-shot and few-shot generalization: Structural segmentation with no or minimal fine-tuning on domains unseen during pretraining, e.g., vesselFM’s >45 Dice-point margin on several modalities, SynthFM-3D’s 2-3× DSC improvement over supervised models in ultrasound (Wittmann et al., 2024, Chakrabarty et al., 18 Jan 2026).
- Probing schemes: Linear and non-linear heads on frozen backbones are standard for evaluating the intrinsic quality of the representation, as in vox2vec’s multi-task CT organ/tumor benchmarks (Goncharov et al., 2023).
Parameter efficiency, memory/computation savings (e.g., dynamic patching in Omni-fMRI reduces attention FLOPs by ~10× (Wang et al., 30 Jan 2026)), and sample efficiency (e.g., SLIM-Brain achieves SOTA with only ~4K fMRI sessions (Wang et al., 26 Dec 2025)) are also reported.
5. Theoretical Guarantees and Methodological Properties
Atlas-free foundation models often leverage theoretical properties to justify or explain their efficacy:
- Distance preservation and random projections: Raptor’s use of the Johnson–Lindenstrauss lemma ensures slice-embedding distances are preserved after random planar tensor reduction, underpinning a formal guarantee on semantic geometry retention (An et al., 11 Jul 2025).
- Hierarchical self-supervision: Anatomically-driven SSL models (Adam) enforce locality and compositionality in the learned feature space, leading to dense, part-whole-aware embedding manifolds (Taher et al., 2023).
- Dynamic scale and masking: Theoretical trade-offs between compressiveness and locality are established via ablation and neural scaling studies, e.g., effect of patch-complexity thresholds in Omni-fMRI (Wang et al., 30 Jan 2026).
- Generalization via synthetic diversity: vesselFM’s and SynthFM-3D’s zero-shot performance is attributed to extensive synthetic domain randomization and flow matching, which enriches the sampling of plausible 3D scenes and intensity distributions, enabling transfer without domain-adaptive modules (Wittmann et al., 2024, Chakrabarty et al., 18 Jan 2026).
6. Practical Limitations, Extensions, and Future Directions
Despite their strengths, atlas-free voxel-level foundation models exhibit several practical and methodological limitations:
- Resource constraints: Scaling to extremely large 3D volumes or high spatial resolutions remains memory- and compute-intensive, necessitating innovations such as hierarchical masking or dynamic patching (Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025).
- Partial geometry handling: Some models (Raptor) assume approximate orthogonality or isotropy in slice features; highly anisotropic or irregular structures may degrade performance (An et al., 11 Jul 2025).
- Task restriction: Binary segmentation (vesselFM), organ-specific adaptation, or lack of explicit modeling of very small-scale features (tiny vessels, fiber tracts) are noted limitations (Wittmann et al., 2024).
- Lack of explicit inter-slice continuity models: Methods such as Raptor and slice-based pipelines do not explicitly encode local 3D neighborhood continuity beyond multi-view aggregation (An et al., 11 Jul 2025).
Potential extensions include:
- Adaptive random projections and structured masking: To improve embedding efficiency or downstream task alignment (e.g., sparse Johnson–Lindenstrauss transforms in Raptor) (An et al., 11 Jul 2025).
- Multi-modal, multi-resolution, and cross-domain fusion: Integrating clinical, anatomical, and imaging data to realize truly universal volumetric foundation models (Veenboer et al., 30 Nov 2025, He et al., 2024).
- Generalization to new physics and sciences: Atlas-free design in crystallography/material informatics (Wei et al., 7 Dec 2025), or synthetic parameterization for transfer to unseen imaging or scientific domains (Chakrabarty et al., 18 Jan 2026).
7. Summary Table: Representative Atlas-Free Voxel-Level Foundation Models
| Model | Architecture | Training Paradigm | Application Domain | Notable Empirical Metrics |
|---|---|---|---|---|
| Raptor | 2D encoder + | Train-free, random | 3D medical imaging (MRI/CT) | +3–14% AUROC vs. baselines (An et al., 11 Jul 2025) |
| random projection | projection | 0.8–0.9 r2 regression | ||
| TAP-CT | Volumetric ViT | DINOv2-style SSL | CT, multi-task | DSC=0.582 vs. 0.489 (2D DINOv2) |
| Omni-fMRI | Dyn. patch ViT | MAE with dynamic patching | fMRI | Outperforms NeuroSTORM, BrainLM |
| SLIM-Brain | 4D Hiera-JEPA | Window selection + JEPA | fMRI, 7 benchmarks | 91.1% ACC sex, 98.5% fingers, <2.4GB |
| vox2vec | 3D FPN | Multi-scale contrastive | CT (organs/tumors) | Linear probe Dice 69.2–75.5% |
| vesselFM | 3D U-Net | Real+synthetic+flow | 3D vessel segmentation | Octa: 46.9 DSC zero-shot; cross modality |
| VISTA3D | 3D U-Net w/ | Supervoxel distillation | Multi-organ segmentation | Dice 0.711–0.85 zero-shot, 127 classes |
| interactive head | & interactive workflows |
Each model listed demonstrates direct voxel-level inference in a fully atlas-free setting across diverse imaging modalities, yielding state-of-the-art performance in established evaluation regimes.
References: (An et al., 11 Jul 2025, Veenboer et al., 30 Nov 2025, Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025, Wittmann et al., 2024, He et al., 2024, Taher et al., 2023, Wei et al., 7 Dec 2025, Goncharov et al., 2023, Chakrabarty et al., 18 Jan 2026)