Zero-Shot fMRI-to-Image Reconstruction

Updated 28 January 2026

The paper presents innovative frameworks that enable zero-shot fMRI-to-image reconstruction by overcoming inter-individual variability through latent factorization and alignment.
It leverages large, standardized multi-subject datasets and advanced preprocessing techniques to robustly map fMRI signals to images, improving metrics like PixCorr and CLIP similarity.
Robust loss functions—including adversarial, cycle-consistency, and contrastive losses—ensure semantic fidelity and enhance generalization for practical brain-computer interfaces.

Zero-shot cross-subject fMRI-to-image reconstruction refers to the automated synthesis of visual stimuli from fMRI data acquired from previously unseen individuals, without requiring subject-specific model fine-tuning or supervised adaptation. This paradigm addresses the fundamental challenge of inter-individual variability in cortical responses, which has historically restricted neural decoding to a single-subject regime and limited the deployment of brain-computer interfaces (BCIs) and neuroscientific exploration at scale. Recent research has established several technically and conceptually distinct frameworks capable of robust zero-shot cross-subject decoding, typically leveraging latent-space disentanglement, explicit factorization strategies, or brain-inspired architectural priors to achieve universal mapping between stimulus-evoked brain activity and image domains.

1. Inter-individual Variability and the Cross-Subject Challenge

The essential barrier to generalizable fMRI-to-image reconstruction is the pronounced variability in cortical organization, hemodynamic response, and stimulus representation across individuals. Even with tightly controlled stimuli, anatomical and functional idiosyncrasies render the mapping from neural data to images highly non-injective and confounded, complicating direct transfer of decoder models between subjects. Classical approaches relying on within-subject training or indirect region-of-interest alignment (e.g., anatomical, functional, or hyperalignment) are insufficient for pure zero-shot transfer, as they typically require at least minimal calibration data from the target subject or assume fixed ROI schemes unsuitable for universal deployment (Dai et al., 7 Feb 2025, Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026).

2. Datasets and Standardization for Cross-Subject Modeling

Robust zero-shot decoding necessitates large, well-curated multi-subject datasets with standardized preprocessing. The UniCortex-fMRI dataset exemplifies this principle, aggregating data from the Natural Scenes Dataset (NSD), BOLD5000, Natural Object Dataset (NOD), and HCP-Movie cohorts—totaling over 219 subjects and 816,660 fMRI–image pairs—while adhering to the fsLR–32k cortical surface representation and harmonized ROI definitions (Huo et al., 21 Jan 2026). Such standardization enables subject-invariant spatial indexing of cortical activations and systematic evaluation across a broad diversity of stimuli and individuals.

Typical preprocessing involves mapping volumetric fMRI to the fsLR cortical surface, z-scoring per session, restricting to visual cortex ROIs (e.g., V1, V2, LO1–3, FFA, PPA), and rasterizing activity to 256×256 cortical maps, as performed in ZEBRA and PictorialCortex (Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026). The use of session-average activation, spatial normalization, and dataset- or subject-specific train/test splits is critical to isolating true cross-subject generalization effects.

3. Core Methodological Frameworks

Three principal architectural paradigms have demonstrated efficacy for zero-shot cross-subject fMRI-to-image synthesis, each with unique technical foundations:

3.1 Universal/Compositional Latent Modeling

PictorialCortex introduces a universal latent encoding of cortical activity via a transformer-based autoencoder pretrained on large-scale population datasets. A latent factorization–composition module then decomposes the universal latent z into a stimulus-driven code c, a subject embedding e^{sub}, a dataset embedding e^{data}, and a nuisance code n, using paired queries and transformer blocks (Huo et al., 21 Jan 2026). The compositional equation $\hat{z} = C(F(z|e^{sub},e^{data}),e^{sub},e^{data})$ allows surrogate latents to be synthesized for any unseen subject via aggregation across the training subject embeddings, thereby enabling zero-shot image reconstruction. Re-factorizing consistency losses further reinforce alignment of latent codes with semantic ground truth.

3.2 Explicit Disentanglement of Subject and Semantic Components

ZEBRA decomposes fMRI-derived representations into subject-invariant and subject-specific components through adversarial training. A pretrained ViT encoder produces patch-token features E, which are split via a shallow self-attention network $\mathcal{F}_i$ into $E_i$ (invariant) and $E_s$ (residual). A subject discriminator enforces the invariance of $E_i$ , while an auxiliary classifier and a BiMixCo contrastive loss anchor the semantic-specific subspace. With this architecture, no adaptation to the test subject is necessary: the invariant pipeline alone suffices for zero-shot decoding (Wang et al., 31 Oct 2025).

3.3 Cross-Subject Alignment and Transfer Matrices

MindAligner implements a low-rank Brain Transfer Matrix (BTM) $M = AB$ that linearly maps fMRI data from a novel subject $S_N$ to the voxel space of a known subject $S_K$ , using only 1 h of unlabeled data from $S_N$ and freezing a pre-trained decoder $D$ . The BTM is optimized via a multi-level loss comprising signal reconstruction, KL divergence, and latent representational similarity. A soft cross-stimulus FiLM-based functional alignment module facilitates mapping in the absence of perfectly paired stimuli (Dai et al., 7 Feb 2025). Once $M$ is trained, truly zero-shot decoding is achieved by pushing $M\cdot F_N$ through $D$ .

3.4 Hierarchical Mixture-of-Experts and Routing

MoRE-Brain employs a hierarchical Mixture-of-Experts (MoE) fMRI encoder with multiple router layers that partition voxels into functional subnetworks, mapping activity onto specialized expert MLPs. Across-subject generalization is enacted by freezing the expert networks and adapting only the per-subject routers (thin parameter layers mapping voxels to expert assignments), which can be efficiently trained on minimal data. The diffusion-based decoder incorporates dual-stage routing (time and space) for dynamic conditioning on semantic and spatial attributes (Wei et al., 21 May 2025).

4. Training Objectives and Losses

All architectures for zero-shot cross-subject reconstruction employ multi-component loss functions to jointly optimize for subject invariance, semantic fidelity, and signal-level alignment. The following loss types are used:

Supervised reconstruction loss: Pixelwise $\ell_2$ or $\ell_1$ loss between reconstructions and ground truth images (Beliy et al., 2019, Dai et al., 7 Feb 2025, Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026).
Cycle-consistency losses: Image–fMRI–image and fMRI–image–fMRI cycles enforce self-consistency and allow self-supervised adaptation to test statistics (Beliy et al., 2019).
Contrastive (InfoNCE, MixCo, BiMixCo) losses: Align fMRI-derived latent representations to CLIP image and text embeddings in a metric space (Dai et al., 7 Feb 2025, Wang et al., 31 Oct 2025, Wei et al., 21 May 2025).
Adversarial invariance/disentanglement: Discriminators and gradient reversal layers force specific subtensors to lose subject-identifying information (Wang et al., 31 Oct 2025).
Latent alignment and regularization: Paired factorization, re-factorizing consistency, and KL divergence enforce geometric and distributional similarity in latent spaces (Huo et al., 21 Jan 2026, Dai et al., 7 Feb 2025).

Optimization is typically staged: pretraining on large subject pools, then loss-based adaptation (if necessary) with strict parameter freezing or thin adaptation layers to preserve zero-shot characteristics (Dai et al., 7 Feb 2025, Wei et al., 21 May 2025, Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026).

5. Experimental Protocols and Quantitative Evaluation

Standardized evaluation leverages shared test splits of fMRI–image pairs from subjects unseen during training, with metrics spanning low-level fidelity and high-level semantic precision. Prominent measures include:

Pixel-wise Pearson correlation (PixCorr)
SSIM (Structural Similarity Index)
LPIPS (Learned Perceptual Image Patch Similarity)
CNN feature matching (AlexNet, Inception, EfficientNet-B, SwAV)
CLIP cosine similarity (vision and text embeddings)
Retrieval tasks (image and brain, top-1 among candidates)

Notable empirical findings:

PictorialCortex achieves mean PixCorr of 0.104 and CLIP accuracy of 71.2%, outperforming MindEye2 (PixCorr 0.095, CLIP 56.5%) and other zero-shot baselines across four datasets (Huo et al., 21 Jan 2026).
ZEBRA reports PixCorr 0.131, AlexNet(5) accuracy 81.2%, CLIP similarity 71.5%, approaching the performance of fully-tuned models and surpassing previous zero-shot approaches by significant margins (Wang et al., 31 Oct 2025).
MindAligner attains a 17.9% brain-retrieval accuracy improvement (57.4%→75.3%) over baselines in data-limited zero-shot adaptation (Dai et al., 7 Feb 2025).
MoRE-Brain demonstrates state-of-the-art reproduction fidelity and strong dependence on fMRI signal bottlenecks, validating true neural decoding rather than prior-driven synthesis (Wei et al., 21 May 2025).

Ablation studies confirm the essential role of compositional latent modeling, adversarial disentanglement, and multi-level-alignment losses in achieving robust cross-subject transfer.

6. Neuroscientific Insights and Interpretability

Zero-shot cross-subject frameworks facilitate new analyses of population-level brain function:

Transfer Quantity (TQ) heatmaps (MindAligner) reveal that early visual cortices (V1/V2) exhibit high conservation and low inter-individual transfer costs, while higher-order regions (e.g., FFA, PPA, OPA) are more variable, echoing their roles in semantic and object processing (Dai et al., 7 Feb 2025).
PictorialCortex shows that latent representations of the same stimulus cluster tightly in c-space across subjects, indicating a well-aligned, subject-invariant embedding for visual information (Huo et al., 21 Jan 2026).
MoRE-Brain leverages explicit routing and Mixture-of-Experts mechanisms, allowing attributions of decoded semantic and spatial features to identifiable cortical subnetworks. ICA on attribution maps produces interpretable components localized to expected ROIs (e.g., visual, dorsal-attention, DMN). Expert activations correlate with specific COCO image categories and spatial salience, supporting modular specialization (Wei et al., 21 May 2025).

A plausible implication is that compositional or modular decoding architectures not only permit superior cross-individual transfer but also provide new avenues for functional parcellation and causal investigation of cortical processing.

7. Limitations and Future Perspectives

Semantic limitation: Despite improved generalization, zero-shot decoders lag behind fully personalized models for rare, fine-grained, or ambiguous categories (Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026).
Dependence on universal representations: Errors and biases in universal cortical autoencoders or CLIP-aligned embedding spaces can propagate and amplify, suggesting the need for ongoing development of more ecologically valid or end-to-end learned representations (Huo et al., 21 Jan 2026).
Dataset scale and diversity: Performance saturates as the number of training subjects increases beyond ∼150, but broader population coverage remains beneficial (Huo et al., 21 Jan 2026). Expanding datasets and exploring video or multimodal stimuli are promising avenues.
Static imagery focus: Extension to dynamic stimuli, imagined content, or nonvisual paradigms is an open technical frontier (Wang et al., 31 Oct 2025, Huo et al., 21 Jan 2026).
ROI and spatial basis dependence: Most current models require strict cortical parcellation and mapping conventions; generalization to less preprocessed, raw fMRI volumes is unresolved.

Future approaches may integrate stronger language-vision co-training, explicit functional alignment priors, or joint end-to-end training of universal encoders and diffusion decoders to further advance the fidelity and robustness of zero-shot neural decoding (Huo et al., 21 Jan 2026, Wang et al., 31 Oct 2025, Dai et al., 7 Feb 2025).