Voxel-wise/Atlas-free 4D Encoders
- Voxel-wise/atlas-free 4D encoders directly operate on raw spatiotemporal data at individual voxel resolution to capture fine-grained structural details.
- They employ dynamic tokenization, masked autoencoding, and spatio-temporal decompositions to efficiently process vast 4D datasets.
- These methods yield state-of-the-art results in domains such as fMRI, LiDAR, and 3D scene modeling while reducing memory usage and computational cost.
Voxel-wise, or atlas-free, 4D encoders refer to computational architectures that process spatiotemporal data directly at the native resolution of individual voxels, without spatial aggregation into predefined atlas regions. Such methods are essential for tasks where preserving voxel-level spatial fidelity is critical, including dynamic brain imaging (fMRI), LiDAR sequence analysis, dynamic 3D scene modeling, and free-viewpoint video compression. They are characterized by high computational and memory demands, but enable unbiased, fine-grained modeling of both spatial and temporal structures in voluminous 4D datasets.
1. Fundamentals of Voxel-wise/Atlas-free 4D Encoding
Voxel-wise 4D encoders directly operate on raw or minimally preprocessed data indexed by three spatial dimensions and one temporal dimension, explicitly avoiding the parcellation or aggregation steps associated with atlas-based approaches. This ensures retention of individual voxel signals—critical for studies where regional definitions introduce bias or occlude fine-grained local effects (e.g., functional heterogeneity in neuroimaging (Wang et al., 26 Dec 2025, Wang et al., 30 Jan 2026, Huang et al., 30 Sep 2025)).
Two principal approaches dominate the field:
- Grid-based 4D tensors: Each voxel (or a patch thereof) is represented as a feature entity spanning time, enabling convolutional, attention-based, or implicit neural processing (Kim et al., 2024, Wang et al., 26 Dec 2025, Wang et al., 30 Jan 2026, Li et al., 2018).
- Dynamic/Adaptive tokenization: Patch or token boundaries are data-driven, determined via spatial/temporal complexity, eliminating the fixed bias of grid or atlas boundaries (Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025).
The inherent high dimensionality requires designs to carefully balance spatial/temporal fidelity with scalability and memory/computation efficiency.
2. Architectural Variants and Core Methodologies
a. Transformer and Masked Autoencoding Approaches
Transformers with multi-head self-attention architectures have been leveraged to jointly model 4D voxel-level dependencies, typically using masked autoencoding (MAE) objectives to facilitate large-scale self-supervised pretraining. For example, Omni-fMRI applies a content-adaptive, scale-aware masked autoencoder that partitions the input 4D volume into background-pruned, variance-adaptive spatiotemporal blocks, embedding each as a token before processing with a 12-layer ViT (Wang et al., 30 Jan 2026). Masking strategies serve both as regularization and as a means to drastically curtail compute cost—Omni-fMRI and SLIM-Brain both mask approximately 70–75% of input patches or tokens in pretraining (Wang et al., 30 Jan 2026, Wang et al., 26 Dec 2025).
b. Multi-stage and Hierarchical Encoders
SLIM-Brain exemplifies multistage approaches, with a temporal extractor (global masked autoencoder) first identifying the most informative temporal windows at a coarse spatial resolution, followed by a hierarchical, patch-based 4D encoder (Hiera-JEPA) that targets voxel-level features only on those selected windows (Wang et al., 26 Dec 2025). This adaptivity, using mutual masked reconstruction scores to prune uninformative windows, enables significant memory and FLOPs reduction compared to naive 4D ViTs.
c. Convolutional and Spatio-Temporal Decomposition Blocks
In voxelwise LiDAR sequence analysis, Flow4D employs a 3D intra-voxel MLP encoder per frame, stacking features over time before passing to an hourglass-style sparse 4D convolutional network (Kim et al., 2024). To mitigate the prohibitive cost of full 4D convolutions, Flow4D introduces Spatio-Temporal Decomposition Blocks (STDBs) that replace the 4D kernel by parallel/serial 3D spatial and 1D temporal convolutions, then fusing their outputs with kernels. This maintains spatiotemporal expressivity at 2–3× lower computation than standard 4D convolution, with only marginal accuracy loss.
d. Implicit Neural Representations & Hash-encoded Grids
F-Hash presents a multi-level hash-based tesseract (4D grid) encoding, with collision-free hash functions mapping dynamic multi-resolution 4D coordinates into embedding tables (Sun et al., 4 Jul 2025). This structure is suitable for time-varying volumetric data where features (e.g., isosurfaces) occupy only a sparse subset of the full grid; convergence is achieved 10–100× faster than dense encoding and typical MLPs.
3. Domain-specific Instantiations
| Method/Paper | Application | Voxel-wise/Atlas-free Key Aspect |
|---|---|---|
| SLIM-Brain (Wang et al., 26 Dec 2025) | fMRI foundation model | Two-stage top-k window selection; 4D Hiera encoder |
| Omni-fMRI (Wang et al., 30 Jan 2026) | Universal fMRI encoder | Dynamic content-adaptive patching and MAE |
| Atlas-free BNT (Huang et al., 30 Sep 2025) | Brain connectome analysis | Individualized voxel-wise feature derivation |
| Flow4D (Kim et al., 2024) | LiDAR scene flow | 4D spatial+temporal conv, STDB decomposition |
| F-Hash (Sun et al., 4 Jul 2025) | Volume visualization | Collision-free 4D tesseract grid + coreset |
| V4D (Gan et al., 2022) | Dynamic NeRF/view synthesis | 3D voxel grid, time-modulated by MLPs |
| PDM-based 4D encoder (Li et al., 2018) | Mesh compression (video) | Pixel-wise descriptors, vote-based 4D AE |
Each architecture tailors its memory, computation, and masking schemes to the task: e.g., in fMRI, attention or hierarchical transformers mitigate cost; in sensor data, sparse convolutions or hash-based addressing target only active or feature-rich regions.
4. Memory, Scalability, and Efficiency Strategies
The principal limitation of atlas-free voxelwise 4D encoding remains memory and computational scalability. Several countermeasures are notable:
- Masking and subsampling: Mask up to 75% of tokens (Wang et al., 26 Dec 2025, Wang et al., 30 Jan 2026).
- Saliency-based selection: Only process top-k temporally salient windows (Wang et al., 26 Dec 2025).
- Dynamic patching: Use spatial and variance-based subdivision to minimize redundant token representation (Wang et al., 30 Jan 2026).
- Sparse tensor and hash encodings: Target only the active coreset or feature-bounding box, avoiding empty or background regions (Sun et al., 4 Jul 2025, Kim et al., 2024).
- Decomposition blocks: Replace dense 4D convolution with parallel/serial 3D+1D blocks to reduce GFLOPs while preserving spatiotemporal receptive field (Kim et al., 2024).
Such designs enable state-of-the-art performance in domains previously dominated by atlas- or region-based aggregation, as shown by both higher accuracy and major reductions in pretraining time and hardware requirements (Wang et al., 26 Dec 2025, Wang et al., 30 Jan 2026).
5. Pretraining Objectives and Optimization
Self-supervised learning via masked reconstruction dominates. Both Omni-fMRI and SLIM-Brain leverage scale-aware MAE losses, with loss normalization by token volume and adaptive reconstruction head selection depending on patch scale (Wang et al., 30 Jan 2026). SLIM-Brain further incorporates a joint-embedding predictive architecture (JEPA) that predicts masked block representations via context encoders, penalizing deviation with Smooth- loss (Wang et al., 26 Dec 2025).
In computer vision and 3D domains, voxel-wise 4D autoencoder compression is critical for mesh and video compression (Li et al., 2018, Gan et al., 2022); MLP decoders regress scalar or color values, supervised via direct regression (e.g., mean squared error) on multiframe reconstructions. Some methods use additional terms for inter-part distinctiveness or geodesic/temporal refinement (Li et al., 2018).
6. Performance Benchmarks and Empirical Observations
Voxel-wise/atlas-free 4D encoders have set new state-of-the-art results on diverse benchmarks:
- Omni-fMRI outperforms ROI-based and windowed-Transformer baselines in age regression, disease classification, and high-resolution image retrieval, with Cohen’s in direct comparisons (Wang et al., 30 Jan 2026).
- Flow4D achieves 45.9% reduction in scene flow estimation error versus prior top-scoring methods, with real-time throughput (15 FPS, 1.62 GiB VRAM) (Kim et al., 2024).
- SLIM-Brain evidences a 2–10× reduction in pretraining data and 70% memory savings relative to standard ViT/Swin baselines, while establishing new benchmarks on seven fMRI datasets (Wang et al., 26 Dec 2025).
- Mesh compression with pixel-wise descriptor–autoencoder pipelines obtains >50× reduction in storage and sub-millimeter Hausdorff error (Li et al., 2018).
A recurring observation is the need to balance spatial/temporal expressivity against complexity; adaptive masking and tokenization are central to resolving this.
7. Practical Limitations and Future Directions
Atlas-free 4D voxel-wise encoders are not without problems: rare or large topological changes can break correspondences in mesh or surface methods (Li et al., 2018), while highly scattered or background-dominated data can nullify efficiency gains from coreset or patch pruning (Sun et al., 4 Jul 2025). Autoencoding or time-trajectory-based pipelines may lack explicit spatial regularization, with global smoothness enforced only via loss terms or further network design.
Potential extensions cited include:
- Transitioning from per-vertex or per-patch encoders to full 4D graph CNNs or volumetric-convolutional architectures for greater spatiotemporal expressiveness (Li et al., 2018).
- Generalizing fMRI pipelines to non-human, non-brain dynamic objects, as the object-agnostic design permits retraining on new domains (Wang et al., 26 Dec 2025).
- Integrating differentiable rendering in end-to-end pipelines to unify correspondence estimation and compression (Li et al., 2018).
- Exploiting content-adaptive patching and efficient hashing for broader spatiotemporal datasets beyond neuroscience or vision, e.g., geospatial, climate, and remote sensing (Sun et al., 4 Jul 2025).
The domain continues to evolve rapidly, with a focus on improved scalability, generalizability, and eliminating domain-specific biases by architecturally enforcing atlas-free, voxel-level fidelity throughout the processing chain.