Papers
Topics
Authors
Recent
Search
2000 character limit reached

Complex-Valued Vision Transformer (kViT)

Updated 2 February 2026
  • Complex-Valued Vision Transformer (kViT) is an architecture that processes raw complex k-space data, preserving both magnitude and phase information for accurate imaging analysis.
  • It employs a physics-informed radial patch embedding and complex-valued layers, attention, and normalization to align with the spectral properties of MRI data.
  • kViT demonstrates significant memory efficiency and improved robustness under aggressive undersampling, outperforming traditional real-valued models in key metrics.

A Complex-Valued Vision Transformer (kViT) is an architecture designed to perform end-to-end learning directly on complex-valued data, such as MRI k-space, utilizing all data components and preserving phase information throughout the model pipeline. Unlike standard Vision Transformers (ViT) that operate on real-valued image data post-magnitude reconstruction, kViT models process raw frequency-domain (k-space) samples and natively employ complex-valued layers, attention, and normalization, thereby aligning the model structure with the physical properties of the input modality (Eilers et al., 2023, Rempe et al., 26 Jan 2026).

1. Core Principles and Motivation

The motivation for kViT arises from the inherent complex-valued nature of many scientific imaging modalities, notably MRI, where k-space data contains both magnitude and phase crucial to accurate modeling and diagnosis. Traditional pipelines discard phase information and rely on local operations that are incongruent with the global, non-local structure of k-space. kViT architectures reframe the problem by operating directly on complex-valued signals without projecting into R2\mathbb{R}^2, thus preserving spectral integrity and enabling direct-from-scanner AI analysis (Rempe et al., 26 Jan 2026).

2. Complex-Valued k-Space Patch Embedding

In kViT, the initial input is a complex-valued k-space array X∈CHƗWX \in \mathbb{C}^{H \times W}. Rather than extracting local grid patches, the kViT leverages a physics-informed radial patching strategy:

  • Each k-space sample X(ui,vi)X(u_i, v_i) is mapped to polar coordinates (ri,Īøi)(r_i, \theta_i).
  • The full HƗWH\times W grid is sorted by rir_i and partitioned into NN concentric ā€œringsā€ (radial patches) of equal cardinality, each containing PP samples (where N=HW/PN = HW/P).
  • Each patch PnP_n comprises all X(u,v)X(u, v) such that R(nāˆ’1)P+1≤u2+v2<RnPR_{(n-1)P+1} \leq \sqrt{u^2+v^2} < R_{nP}.
  • Every radial patch (complex vector of size PP) is flattened and embedded using a complex weight matrix W∈CdƗPW \in \mathbb{C}^{d \times P} and bias b∈Cdb \in \mathbb{C}^d:

en=Wxn+b=(Wrxn(r)āˆ’Wixn(i)+br)+i(Wrxn(i)+Wixn(r)+bi)\mathbf e_n = W x_n + b = (W_r x_n^{(r)} - W_i x_n^{(i)} + b_r) + i (W_r x_n^{(i)} + W_i x_n^{(r)} + b_i)

This radial scheme better reflects the spectral locality and energy distribution of MRI acquisition compared to Cartesian patches, and is critical for maintaining robustness under high acceleration (undersampling) (Rempe et al., 26 Jan 2026).

3. Complex-Valued Transformer Building Blocks

The kViT replaces all core components of the standard ViT with their complex-valued analogues as defined in (Eilers et al., 2023):

  • Complex Linear Layers: Given x∈Cnx \in \mathbb{C}^n, y=Wx+by = W x + b with WW and bb complex. Real and imaginary parts are parameterized separately for implementation. Initialization uses the complex Glorot (Xavier) scheme by sampling magnitude ∣Wjk∣|W_{jk}| from Rayleigh(σ)(\sigma) with σ2=2/(fanin+fanout)\sigma^2 = 2/(fan_{in} + fan_{out}).
  • Complex Attention: Multi-head self-attention operates with complex queries, keys, and values (Q,K,V∈CNƗdQ, K, V \in \mathbb{C}^{N \times d}). The standard variant (ā€œCAttā€) computes attention as:

CAtt(Q,K,V)=softmax(ā„œ(QKH)/d) VCAtt(Q, K, V) = \mathrm{softmax}(\Re(Q K^H) / \sqrt{d})\, V

Here, KHK^H denotes Hermitian (conjugate transpose) and softmax is computed over the real part. Several alternative attention mechanisms—absolute, phase-preserving, and real/imaginary split—are possible.

  • Complex Layer Normalization: Each complex vector z∈Cdz \in \mathbb{C}^d is treated as a 2Ɨd2\times d real matrix (stacking real and imaginary parts), estimated with mean μ\mu, 2Ɨ22\times2 covariance covariances, and whitened accordingly. A learnable affine transform with positive-definite scaling is applied.
  • Complex Feed-Forward Networks: Use two stacked complex linear layers separated by complex-valued nonlinearities (e.g., modReLU, zReLU, or split ReLU on real/imag components).
  • Positional Encoding: Patch positions are injected via either (a) learnable complex positional embeddings Ļ•n∈Cd\phi_n \in \mathbb{C}^d or (b) rotary position encoding (RoPE) adapted to the complex domain. The RoPE rotates each (2i,2i+1)(2i,2i+1) pair by Īøn=nωi\theta_n = n \omega_i:

(cos⁔(nωi)āˆ’sin⁔(nωi)Ā sin⁔(nωi)cos⁔(nωi))\begin{pmatrix} \cos(n\omega_i) & -\sin(n\omega_i) \ \sin(n\omega_i) & \cos(n\omega_i) \end{pmatrix}

4. Model Architecture and Training Details

The typical architecture follows a transformer encoder paradigm, stacking LL complex transformer layers, each containing:

  1. Complex LayerNorm
  2. Complex Multi-Head Attention (using radial-patched tokens)
  3. Complex LayerNorm (post-attention)
  4. Complex Feed-Forward Network with residual connections

Classification models prepend a learnable complex class token xcls∈Cdx_{cls} \in \mathbb{C}^d, which is used as the summary representation for downstream decision layers (taking either ∣xcls∣|x_{cls}| or ā„œ(xcls)\Re(x_{cls}) as input to a final real-valued softmax head).

Key hyperparameters from (Rempe et al., 26 Jan 2026, Eilers et al., 2023):

Dataset/Task Layers Heads Embedding dd Patch Tokens NN MLP dim Dropout Batch Size Params (M) VRAM (GB)
fastMRI Prostate/Knee 6 16 256 16 768 0.1 64 4.9 0.96
In-house MIL (brain) 2 4 48 -- 92 -- 1 -- 0.52

The models are trained via AdamW (lr=1eāˆ’41\mathrm{e}{-4}), with cross-entropy loss and early stopping (patience=15). Data are subjected to acceleration factors R∈{2,4,…,24}R\in\{2,4,\ldots,24\}, applying cutout augmentation as necessary to enhance generalization (Rempe et al., 26 Jan 2026, Eilers et al., 2023).

5. Efficiency, Robustness, and Empirical Results

kViT models demonstrate significant computational and statistical advantages over real-valued baselines:

  • Memory Efficiency: Attention operates over N=16N=16 tokens (radial patches), yielding O(N2d)O(N^2 d) complexity, as opposed to O(1962d)O(196^2 d) for standard 14Ɨ1414\times14 ViTs. On fastMRI Prostate (batch size 64), kViT required 0.96 GB VRAM vs. 10.6 GB for ResNet50 and 3.7 GB for ViT-Tiny; in multi-instance learning, kViT used 0.52 GB versus ViT-Tiny 11.7 GB and EfficientNet-B0 35.4 GB—a 68Ɨ\times reduction (Rempe et al., 26 Jan 2026).
  • Overfitting and Generalization: Complex-valued transformers exhibit reduced overfitting relative to doubled-real (ā€œstacked R/Iā€) counterparts; in MusicNet, real ViT overfits by epoch 10, while complex variants remain stable (Eilers et al., 2023).
  • Performance and Robustness: On fastMRI Prostate test data, kViT achieves AUROC 78.2±2.278.2\pm 2.2 at full sampling, and 77.0±3.077.0\pm 3.0 under 16Ɨ16\times undersampling, outperforming or matching real-valued models. ViT-Tiny drops precipitously to 56.3±6.256.3\pm6.2 under the same conditions. In-house multi-instance learning (MIL) shows kViT maintains AUPRC around 54 even under aggressive undersampling, whereas ViT-Tiny degrades to 43 (Rempe et al., 26 Jan 2026).
  • Ablation Findings: Using fewer or more rings (radial patches) affects AUROC non-monotonically (optimum at 16), and removing RoPE or augmentation reduces accuracy. Discarding phase produces severe collapse in performance (AUROC from 0.685 to 0.551 at 16Ɨ16\times acceleration), confirming the necessity of preserving phase.

6. Key Implementation Considerations and Required Modules

Implementation draws upon standardized complex-valued NN recipes (e.g., Trabelsi et al. ICLR 2018) and the architectural utilities from (Eilers et al., 2023):

  • Complex Linear Layer: Trabelsi-form complex FC, forward and Wirtinger-based backward pass.
  • Complex Multi-Head Attention: CAtt preferred; variant selection (AAtt, APAtt, RI Att) possible.
  • Complex LayerNorm: Covariance-based normalization with learnable positive-definite scaling.
  • Complex Feed-Forward: Two layers, nonlinearities selected from modReLU/zReLU/split ReLU.
  • Residuals, Dropout, Optimizer: Standard transformer scaffolding; Adam on split real/imag parameters suffices. No specialized optimizer is required.

These core modules enable operation throughout on raw complex-valued data, thus avoiding the loss of representational fidelity incurred by mapping C→R2\mathbb{C}\to\mathbb{R}^2, while yielding both accuracy and significant gains in robustness and computational efficiency (Eilers et al., 2023, Rempe et al., 26 Jan 2026).

7. Significance and Domain-Specific Impact

kViT establishes a framework for domain-aligned, resource-efficient analysis of frequency-domain data, particularly in MRI where phase and spectral locality are essential. The architecture achieves SOTA robustness under aggressive undersampling, sustains classification accuracy competitive with real-valued baselines, and demonstrates substantial reductions in memory and computational load (Rempe et al., 26 Jan 2026). The use of physics-informed patching and complex-valued operations provides an architectural blueprint for extending Transformers to other complex-data-intensive domains, including remote sensing and scientific imaging. These advances suggest kViT may become the preferred paradigm for direct-from-scanner AI analysis, providing a pathway for improved fidelity and efficiency in complex-valued signal processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Complex-Valued Vision Transformer (kViT).