Volumetric Tokenization for MRI

Updated 4 December 2025

The paper demonstrates a novel train-free approach using slicewise token extraction and random projection that maintains over 95% semantic fidelity while reducing computational costs.
It integrates 3D patch embedding with transformer-based local self-attention to efficiently capture structural context in MRI volumes.
Frequency-aware encoder-decoder frameworks with wavelet compression are leveraged to balance high-frequency detail retention with token compactness for clinical applications.

Volumetric tokenization strategy for MRI encompasses the systematic conversion of three-dimensional magnetic resonance imaging (MRI) volumes into discrete, compact representations—volumetric tokens—enabling efficient downstream deep learning tasks without prohibitive computational cost or dependency on extensive labeled datasets. Recent advances span train-free cross-sectional aggregation and random projections, windowed convolutional encoding, and transformer-based local self-attention, each designed to maintain high semantic fidelity and scalability within the constraints of large medical imaging corpora (An et al., 11 Jul 2025, Forigua et al., 2024, Hamamci et al., 23 Oct 2025).

1. Foundational Approaches to Volumetric Tokenization

Research on volumetric tokenization for MRI primarily delineates three paradigms:

Slicewise 2D Token Extraction and Compression (Raptor): Exploits frozen 2D vision transformers to serialize volumetric context across anatomical axes, followed by random projection-based reduction (An et al., 11 Jul 2025).
Direct 3D Patch Embedding and Attention (SuperFormer): Utilizes 3D non-overlapping patch extraction and positional encoding to embed spatial context explicitly, forming the basis for volumetric transformers (Forigua et al., 2024).
Frequency-Aware Encoder-Decoder with Quantization (BTB3D): Implements causal factorized 3D convolutions with wavelet-based pre-processing and binary quantization to generate compact, discrete tokens, optimized for long-context vision-language modeling in 3D (Hamamci et al., 23 Oct 2025).

Each method prioritizes computational tractability, memory efficiency, and semantic preservation, with methodological innovations targeting the balance between anatomical fidelity and embedding compactness.

2. Slicewise Tokenization and Random Planar Tensor Reduction

Raptor (An et al., 11 Jul 2025) defines a three-stage, train-free tokenization pipeline:

2D Token Extraction: For an MRI volume $\mathcal V\in\mathbb R^{D\times H\times W}$ , cross-sectional slices are extracted along axial, coronal, and sagittal planes.
Patchwise Tokenization: Each $H\times W$ slice is partitioned into $N_p$ patches ( $T\times T$ pixels each), and a frozen pretrained 2D ViT encodes each patch into a $C$ -dimensional embedding, producing a 3D token tensor $\mathcal T^{(v)}\in\mathbb R^{D\times N_p\times C}$ per view.
Slice-Axis Aggregation: Tokens are mean-pooled along the depth axis for each view: $\overline{\mathcal T}^{(v)}_{i,c} = \frac1D\sum_{d=1}^D \mathcal T^{(v)}_{\,d,i,c}$ , resulting in $N_p\times C$ matrices.
Random Projection: Aggregated tokens are concatenated for all three views, flattened, and projected using a fixed Gaussian random matrix $R\in\mathbb R^{K\times m}$ to produce a length- $L=3K$ embedding vector.

This stacking and projection avoid any further training beyond the frozen 2D backbone, yielding volumetric tokens suitable for classification and regression tasks without training overhead or overfitting risk in low-data settings.

3. Patchwise 3D Embedding and Transformer Architectures

SuperFormer (Forigua et al., 2024) introduces direct volumetric patch tokenization and transformer-based context integration:

3D Patch Extraction: Input volumes $X\in\mathbb R^{H\times W\times D\times C}$ are split into non-overlapping $2\times 2\times 2$ patches, each flattened to a vector and linearly mapped via $W\in\mathbb R^{D\times(p_xp_yp_zC)}$ .
Linear Embedding: Embedding yields tokens $E_k=W x_k + b$ for each patch, resulting in a 3D token grid.
3D Relative Positional Encoding: Local 3D spatial context is preserved using trainable bias tables $R\in\mathbb R^{(2M-1)^3}$ , added to the attention logits to encode positional dependencies within each $M\times M\times M$ window.
Windowed Self-Attention: Local self-attention is computed within each window, enabling efficient modeling of in-plane and through-plane dependencies without cubic attention complexity.

This strategy robustly encodes both local and extended structural context, facilitating anatomically consistent super-resolution, as validated on the Human Connectome Project dataset.

4. Causal Encoder-Decoder and Frequency-Aware Token Generation

BTB3D (Hamamci et al., 23 Oct 2025) implements a frequency-aware, convolutional encoder-decoder pipeline with a multi-stage training curriculum for scalable volumetric tokenization:

Wavelet Compression: MRI volumes undergo 3D Haar wavelet transforms, yielding low-frequency and high-frequency channels for each $2\times 2\times 2$ voxel block.
Factorized 3D Convolutional Encoding: Residual blocks perform spatial $(1\times k\times k)$ and temporal $(k\times1\times1)$ convolutions, with downsampling implemented via strided convolutions.
Binary Quantization: Post-encoder feature blocks $y$ are binarized via $\mathrm{sign}(y)$ and bit-packed to form integer-valued discrete tokens $z$ .
Overlapping-Window Tiling: Long volumes are decomposed into overlapping windows (e.g., 9 slices, stride 8), with causal context maintained via token overlap and a decoder-only refinement stage ensuring global consistency.

Frequency domain operations, entropy regularization, and sequence-aware overlaps collectively counteract memory bottlenecks and preserve clinical detail across long MRI sequences.

5. Hyperparameter Choices and Compression Trade-offs

Key parameters for volumetric tokenization strategies include:

Approach	Patch Size	Embedding Dim.	Token Sequence Length	Projection/Code Dim.
Raptor	$T=16$	$C=1024$	$3 \times D \times N_p$	$L=3K$ , $K=10/100$
SuperFormer	$2\times2\times2$	$D=252$	$N = (H/p_x)(W/p_y)(D/p_z)$	N/A
BTB3D	$2\times2\times2$ (wavelet)	$d=16/18/20$	$D'/S$ for windowed tiling	$K=2^d$

Smaller projection or code dimensions yield greater compression (lower storage and compute), with empirical analyses in Raptor showing $K=100$ preserves $>99\%$ of semantic pairwise distances, and $K=10$ still retaining $>95\%$ accuracy. SuperFormer sets $D=252$ and window size $M=8$ , with ablation studies guiding the design. BTB3D tunes code dimension $d$ per acquisition protocol (T1, T2, FLAIR), balancing high-frequency fidelity and token compactness.

6. Comparative Evaluation and Domain Adaptation

Semantic Preservation: Raptor leverages the Johnson–Lindenstrauss lemma to guarantee pairwise distance preservation after random projection, achieving consistent AUROC performance plateaus at moderate $K$ .
Expressivity vs. Efficiency: Raptor bypasses cubic compute associated with 3D attention and convolution; SuperFormer retains efficient context integration via local windows; BTB3D maintains global structure with linear overlap-based scaling.
MRI-Specific Customization: BTB3D recommends adjusting window/patch sizes and code dimensions to suit volumetric anisotropy and contrast. Frequency-channel weighting is recommended for sequence-specific denoising or enhancement (e.g., up-weight T2 high-frequency bands).
Resilience to Data Scarcity: The non-adaptive, train-free nature of Raptor’s tokenization is robust even with as few as 10 samples, which still yields $75$– $80\%$ of peak accuracy.

7. Methodological Impact and Applications

Volumetric tokenization strategies enable MRI volumes to be embedded for diverse tasks including classification, segmentation, super-resolution, and text-image vision-language modeling:

Raptor achieves up to $+14\%$ AUC gain over state-of-the-art 3D pretraining approaches in multi-task evaluation, while reducing latent size by $10$– $100\times$ (An et al., 11 Jul 2025).
SuperFormer’s transformer architecture yields superior PSNR and SSIM in super-resolved MRI, outperforming 3D CNN baselines (Forigua et al., 2024).
BTB3D reduces FID and FVD for text-to-image volumetric synthesis tasks, enabling long-context report generation at previously unattainable memory and token sequence lengths (Hamamci et al., 23 Oct 2025).

These methodological innovations substantially reduce computational barriers to large-scale learning on high-resolution MRI data, enabling new regimes of scalable foundation modeling, automated report generation, and clinically faithful image synthesis in 3D medical imaging.

References:

Raptor (An et al., 11 Jul 2025) SuperFormer (Forigua et al., 2024) BTB3D (Hamamci et al., 23 Oct 2025)