Complex-Valued Vision Transformer (kViT)
- Complex-Valued Vision Transformer (kViT) is an architecture that processes raw complex k-space data, preserving both magnitude and phase information for accurate imaging analysis.
- It employs a physics-informed radial patch embedding and complex-valued layers, attention, and normalization to align with the spectral properties of MRI data.
- kViT demonstrates significant memory efficiency and improved robustness under aggressive undersampling, outperforming traditional real-valued models in key metrics.
A Complex-Valued Vision Transformer (kViT) is an architecture designed to perform end-to-end learning directly on complex-valued data, such as MRI k-space, utilizing all data components and preserving phase information throughout the model pipeline. Unlike standard Vision Transformers (ViT) that operate on real-valued image data post-magnitude reconstruction, kViT models process raw frequency-domain (k-space) samples and natively employ complex-valued layers, attention, and normalization, thereby aligning the model structure with the physical properties of the input modality (Eilers et al., 2023, Rempe et al., 26 Jan 2026).
1. Core Principles and Motivation
The motivation for kViT arises from the inherent complex-valued nature of many scientific imaging modalities, notably MRI, where k-space data contains both magnitude and phase crucial to accurate modeling and diagnosis. Traditional pipelines discard phase information and rely on local operations that are incongruent with the global, non-local structure of k-space. kViT architectures reframe the problem by operating directly on complex-valued signals without projecting into , thus preserving spectral integrity and enabling direct-from-scanner AI analysis (Rempe et al., 26 Jan 2026).
2. Complex-Valued k-Space Patch Embedding
In kViT, the initial input is a complex-valued k-space array . Rather than extracting local grid patches, the kViT leverages a physics-informed radial patching strategy:
- Each k-space sample is mapped to polar coordinates .
- The full grid is sorted by and partitioned into concentric āringsā (radial patches) of equal cardinality, each containing samples (where ).
- Each patch comprises all such that .
- Every radial patch (complex vector of size ) is flattened and embedded using a complex weight matrix and bias :
This radial scheme better reflects the spectral locality and energy distribution of MRI acquisition compared to Cartesian patches, and is critical for maintaining robustness under high acceleration (undersampling) (Rempe et al., 26 Jan 2026).
3. Complex-Valued Transformer Building Blocks
The kViT replaces all core components of the standard ViT with their complex-valued analogues as defined in (Eilers et al., 2023):
- Complex Linear Layers: Given , with and complex. Real and imaginary parts are parameterized separately for implementation. Initialization uses the complex Glorot (Xavier) scheme by sampling magnitude from Rayleigh with .
- Complex Attention: Multi-head self-attention operates with complex queries, keys, and values (). The standard variant (āCAttā) computes attention as:
Here, denotes Hermitian (conjugate transpose) and softmax is computed over the real part. Several alternative attention mechanismsāabsolute, phase-preserving, and real/imaginary splitāare possible.
- Complex Layer Normalization: Each complex vector is treated as a real matrix (stacking real and imaginary parts), estimated with mean , covariance covariances, and whitened accordingly. A learnable affine transform with positive-definite scaling is applied.
- Complex Feed-Forward Networks: Use two stacked complex linear layers separated by complex-valued nonlinearities (e.g., modReLU, zReLU, or split ReLU on real/imag components).
- Positional Encoding: Patch positions are injected via either (a) learnable complex positional embeddings or (b) rotary position encoding (RoPE) adapted to the complex domain. The RoPE rotates each pair by :
4. Model Architecture and Training Details
The typical architecture follows a transformer encoder paradigm, stacking complex transformer layers, each containing:
- Complex LayerNorm
- Complex Multi-Head Attention (using radial-patched tokens)
- Complex LayerNorm (post-attention)
- Complex Feed-Forward Network with residual connections
Classification models prepend a learnable complex class token , which is used as the summary representation for downstream decision layers (taking either or as input to a final real-valued softmax head).
Key hyperparameters from (Rempe et al., 26 Jan 2026, Eilers et al., 2023):
| Dataset/Task | Layers | Heads | Embedding | Patch Tokens | MLP dim | Dropout | Batch Size | Params (M) | VRAM (GB) |
|---|---|---|---|---|---|---|---|---|---|
| fastMRI Prostate/Knee | 6 | 16 | 256 | 16 | 768 | 0.1 | 64 | 4.9 | 0.96 |
| In-house MIL (brain) | 2 | 4 | 48 | -- | 92 | -- | 1 | -- | 0.52 |
The models are trained via AdamW (lr=), with cross-entropy loss and early stopping (patience=15). Data are subjected to acceleration factors , applying cutout augmentation as necessary to enhance generalization (Rempe et al., 26 Jan 2026, Eilers et al., 2023).
5. Efficiency, Robustness, and Empirical Results
kViT models demonstrate significant computational and statistical advantages over real-valued baselines:
- Memory Efficiency: Attention operates over tokens (radial patches), yielding complexity, as opposed to for standard ViTs. On fastMRI Prostate (batch size 64), kViT required 0.96 GB VRAM vs. 10.6 GB for ResNet50 and 3.7 GB for ViT-Tiny; in multi-instance learning, kViT used 0.52 GB versus ViT-Tiny 11.7 GB and EfficientNet-B0 35.4 GBāa 68 reduction (Rempe et al., 26 Jan 2026).
- Overfitting and Generalization: Complex-valued transformers exhibit reduced overfitting relative to doubled-real (āstacked R/Iā) counterparts; in MusicNet, real ViT overfits by epoch 10, while complex variants remain stable (Eilers et al., 2023).
- Performance and Robustness: On fastMRI Prostate test data, kViT achieves AUROC at full sampling, and under undersampling, outperforming or matching real-valued models. ViT-Tiny drops precipitously to under the same conditions. In-house multi-instance learning (MIL) shows kViT maintains AUPRC around 54 even under aggressive undersampling, whereas ViT-Tiny degrades to 43 (Rempe et al., 26 Jan 2026).
- Ablation Findings: Using fewer or more rings (radial patches) affects AUROC non-monotonically (optimum at 16), and removing RoPE or augmentation reduces accuracy. Discarding phase produces severe collapse in performance (AUROC from 0.685 to 0.551 at acceleration), confirming the necessity of preserving phase.
6. Key Implementation Considerations and Required Modules
Implementation draws upon standardized complex-valued NN recipes (e.g., Trabelsi et al. ICLR 2018) and the architectural utilities from (Eilers et al., 2023):
- Complex Linear Layer: Trabelsi-form complex FC, forward and Wirtinger-based backward pass.
- Complex Multi-Head Attention: CAtt preferred; variant selection (AAtt, APAtt, RI Att) possible.
- Complex LayerNorm: Covariance-based normalization with learnable positive-definite scaling.
- Complex Feed-Forward: Two layers, nonlinearities selected from modReLU/zReLU/split ReLU.
- Residuals, Dropout, Optimizer: Standard transformer scaffolding; Adam on split real/imag parameters suffices. No specialized optimizer is required.
These core modules enable operation throughout on raw complex-valued data, thus avoiding the loss of representational fidelity incurred by mapping , while yielding both accuracy and significant gains in robustness and computational efficiency (Eilers et al., 2023, Rempe et al., 26 Jan 2026).
7. Significance and Domain-Specific Impact
kViT establishes a framework for domain-aligned, resource-efficient analysis of frequency-domain data, particularly in MRI where phase and spectral locality are essential. The architecture achieves SOTA robustness under aggressive undersampling, sustains classification accuracy competitive with real-valued baselines, and demonstrates substantial reductions in memory and computational load (Rempe et al., 26 Jan 2026). The use of physics-informed patching and complex-valued operations provides an architectural blueprint for extending Transformers to other complex-data-intensive domains, including remote sensing and scientific imaging. These advances suggest kViT may become the preferred paradigm for direct-from-scanner AI analysis, providing a pathway for improved fidelity and efficiency in complex-valued signal processing.