Complex-Valued Vision Transformer (kViT)

Updated 2 February 2026

Complex-Valued Vision Transformer (kViT) is an architecture that processes raw complex k-space data, preserving both magnitude and phase information for accurate imaging analysis.
It employs a physics-informed radial patch embedding and complex-valued layers, attention, and normalization to align with the spectral properties of MRI data.
kViT demonstrates significant memory efficiency and improved robustness under aggressive undersampling, outperforming traditional real-valued models in key metrics.

A Complex-Valued Vision Transformer (kViT) is an architecture designed to perform end-to-end learning directly on complex-valued data, such as MRI k-space, utilizing all data components and preserving phase information throughout the model pipeline. Unlike standard Vision Transformers (ViT) that operate on real-valued image data post-magnitude reconstruction, kViT models process raw frequency-domain (k-space) samples and natively employ complex-valued layers, attention, and normalization, thereby aligning the model structure with the physical properties of the input modality (Eilers et al., 2023, Rempe et al., 26 Jan 2026).

1. Core Principles and Motivation

The motivation for kViT arises from the inherent complex-valued nature of many scientific imaging modalities, notably MRI, where k-space data contains both magnitude and phase crucial to accurate modeling and diagnosis. Traditional pipelines discard phase information and rely on local operations that are incongruent with the global, non-local structure of k-space. kViT architectures reframe the problem by operating directly on complex-valued signals without projecting into $\mathbb{R}^2$ , thus preserving spectral integrity and enabling direct-from-scanner AI analysis (Rempe et al., 26 Jan 2026).

2. Complex-Valued k-Space Patch Embedding

In kViT, the initial input is a complex-valued k-space array $X \in \mathbb{C}^{H \times W}$ . Rather than extracting local grid patches, the kViT leverages a physics-informed radial patching strategy:

Each k-space sample $X(u_i, v_i)$ is mapped to polar coordinates $(r_i, \theta_i)$ .
The full $H\times W$ grid is sorted by $r_i$ and partitioned into $N$ concentric “rings” (radial patches) of equal cardinality, each containing $P$ samples (where $N = HW/P$ ).
Each patch $P_n$ comprises all $X(u, v)$ such that $R_{(n-1)P+1} \leq \sqrt{u^2+v^2} < R_{nP}$ .
Every radial patch (complex vector of size $P$ ) is flattened and embedded using a complex weight matrix $W \in \mathbb{C}^{d \times P}$ and bias $b \in \mathbb{C}^d$ :

$\mathbf e_n = W x_n + b = (W_r x_n^{(r)} - W_i x_n^{(i)} + b_r) + i (W_r x_n^{(i)} + W_i x_n^{(r)} + b_i)$

This radial scheme better reflects the spectral locality and energy distribution of MRI acquisition compared to Cartesian patches, and is critical for maintaining robustness under high acceleration (undersampling) (Rempe et al., 26 Jan 2026).

3. Complex-Valued Transformer Building Blocks

The kViT replaces all core components of the standard ViT with their complex-valued analogues as defined in (Eilers et al., 2023):

Complex Linear Layers: Given $x \in \mathbb{C}^n$ , $y = W x + b$ with $W$ and $b$ complex. Real and imaginary parts are parameterized separately for implementation. Initialization uses the complex Glorot (Xavier) scheme by sampling magnitude $|W_{jk}|$ from Rayleigh $(\sigma)$ with $\sigma^2 = 2/(fan_{in} + fan_{out})$ .
Complex Attention: Multi-head self-attention operates with complex queries, keys, and values ( $Q, K, V \in \mathbb{C}^{N \times d}$ ). The standard variant (“CAtt”) computes attention as:

$CAtt(Q, K, V) = \mathrm{softmax}(\Re(Q K^H) / \sqrt{d})\, V$

Here, $K^H$ denotes Hermitian (conjugate transpose) and softmax is computed over the real part. Several alternative attention mechanisms—absolute, phase-preserving, and real/imaginary split—are possible.

Complex Layer Normalization: Each complex vector $z \in \mathbb{C}^d$ is treated as a $2\times d$ real matrix (stacking real and imaginary parts), estimated with mean $\mu$ , $2\times2$ covariance covariances, and whitened accordingly. A learnable affine transform with positive-definite scaling is applied.
Complex Feed-Forward Networks: Use two stacked complex linear layers separated by complex-valued nonlinearities (e.g., modReLU, zReLU, or split ReLU on real/imag components).
Positional Encoding: Patch positions are injected via either (a) learnable complex positional embeddings $\phi_n \in \mathbb{C}^d$ or (b) rotary position encoding (RoPE) adapted to the complex domain. The RoPE rotates each $(2i,2i+1)$ pair by $\theta_n = n \omega_i$ :

$\begin{pmatrix} \cos(n\omega_i) & -\sin(n\omega_i) \ \sin(n\omega_i) & \cos(n\omega_i) \end{pmatrix}$

4. Model Architecture and Training Details

The typical architecture follows a transformer encoder paradigm, stacking $L$ complex transformer layers, each containing:

Complex LayerNorm
Complex Multi-Head Attention (using radial-patched tokens)
Complex LayerNorm (post-attention)
Complex Feed-Forward Network with residual connections

Classification models prepend a learnable complex class token $x_{cls} \in \mathbb{C}^d$ , which is used as the summary representation for downstream decision layers (taking either $|x_{cls}|$ or $\Re(x_{cls})$ as input to a final real-valued softmax head).

Key hyperparameters from (Rempe et al., 26 Jan 2026, Eilers et al., 2023):

Dataset/Task	Layers	Heads	Embedding $d$	Patch Tokens $N$	MLP dim	Dropout	Batch Size	Params (M)	VRAM (GB)
fastMRI Prostate/Knee	6	16	256	16	768	0.1	64	4.9	0.96
In-house MIL (brain)	2	4	48	--	92	--	1	--	0.52

The models are trained via AdamW (lr= $1\mathrm{e}{-4}$ ), with cross-entropy loss and early stopping (patience=15). Data are subjected to acceleration factors $R\in\{2,4,\ldots,24\}$ , applying cutout augmentation as necessary to enhance generalization (Rempe et al., 26 Jan 2026, Eilers et al., 2023).

5. Efficiency, Robustness, and Empirical Results

kViT models demonstrate significant computational and statistical advantages over real-valued baselines:

Memory Efficiency: Attention operates over $N=16$ tokens (radial patches), yielding $O(N^2 d)$ complexity, as opposed to $O(196^2 d)$ for standard $14\times14$ ViTs. On fastMRI Prostate (batch size 64), kViT required 0.96 GB VRAM vs. 10.6 GB for ResNet50 and 3.7 GB for ViT-Tiny; in multi-instance learning, kViT used 0.52 GB versus ViT-Tiny 11.7 GB and EfficientNet-B0 35.4 GB—a 68 $\times$ reduction (Rempe et al., 26 Jan 2026).
Overfitting and Generalization: Complex-valued transformers exhibit reduced overfitting relative to doubled-real (“stacked R/I”) counterparts; in MusicNet, real ViT overfits by epoch 10, while complex variants remain stable (Eilers et al., 2023).
Performance and Robustness: On fastMRI Prostate test data, kViT achieves AUROC $78.2\pm 2.2$ at full sampling, and $77.0\pm 3.0$ under $16\times$ undersampling, outperforming or matching real-valued models. ViT-Tiny drops precipitously to $56.3\pm6.2$ under the same conditions. In-house multi-instance learning (MIL) shows kViT maintains AUPRC around 54 even under aggressive undersampling, whereas ViT-Tiny degrades to 43 (Rempe et al., 26 Jan 2026).
Ablation Findings: Using fewer or more rings (radial patches) affects AUROC non-monotonically (optimum at 16), and removing RoPE or augmentation reduces accuracy. Discarding phase produces severe collapse in performance (AUROC from 0.685 to 0.551 at $16\times$ acceleration), confirming the necessity of preserving phase.

6. Key Implementation Considerations and Required Modules

Implementation draws upon standardized complex-valued NN recipes (e.g., Trabelsi et al. ICLR 2018) and the architectural utilities from (Eilers et al., 2023):

Complex Linear Layer: Trabelsi-form complex FC, forward and Wirtinger-based backward pass.
Complex Multi-Head Attention: CAtt preferred; variant selection (AAtt, APAtt, RI Att) possible.
Complex LayerNorm: Covariance-based normalization with learnable positive-definite scaling.
Complex Feed-Forward: Two layers, nonlinearities selected from modReLU/zReLU/split ReLU.
Residuals, Dropout, Optimizer: Standard transformer scaffolding; Adam on split real/imag parameters suffices. No specialized optimizer is required.

These core modules enable operation throughout on raw complex-valued data, thus avoiding the loss of representational fidelity incurred by mapping $\mathbb{C}\to\mathbb{R}^2$ , while yielding both accuracy and significant gains in robustness and computational efficiency (Eilers et al., 2023, Rempe et al., 26 Jan 2026).

7. Significance and Domain-Specific Impact

kViT establishes a framework for domain-aligned, resource-efficient analysis of frequency-domain data, particularly in MRI where phase and spectral locality are essential. The architecture achieves SOTA robustness under aggressive undersampling, sustains classification accuracy competitive with real-valued baselines, and demonstrates substantial reductions in memory and computational load (Rempe et al., 26 Jan 2026). The use of physics-informed patching and complex-valued operations provides an architectural blueprint for extending Transformers to other complex-data-intensive domains, including remote sensing and scientific imaging. These advances suggest kViT may become the preferred paradigm for direct-from-scanner AI analysis, providing a pathway for improved fidelity and efficiency in complex-valued signal processing.

Markdown Report Issue Upgrade to Chat

References (2)

Building Blocks for a Complex-Valued Transformer Architecture (2023)

Efficient Complex-Valued Vision Transformers for MRI Classification Directly from k-Space (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complex-Valued Vision Transformer (kViT).