Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

Published 20 May 2025 in cs.CV | (2505.14556v1)

Abstract: Brain-to-image decoding has been recently propelled by the progress in generative AI models and the availability of large ultra-high field functional Magnetic Resonance Imaging (fMRI). However, current approaches depend on complicated multi-stage pipelines and preprocessing steps that typically collapse the temporal dimension of brain recordings, thereby limiting time-resolved brain decoders. Here, we introduce Dynadiff (Dynamic Neural Activity Diffusion for Image Reconstruction), a new single-stage diffusion model designed for reconstructing images from dynamically evolving fMRI recordings. Our approach offers three main contributions. First, Dynadiff simplifies training as compared to existing approaches. Second, our model outperforms state-of-the-art models on time-resolved fMRI signals, especially on high-level semantic image reconstruction metrics, while remaining competitive on preprocessed fMRI data that collapse time. Third, this approach allows a precise characterization of the evolution of image representations in brain activity. Overall, this work lays the foundation for time-resolved brain-to-image decoding.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel single-stage diffusion-based fMRI decoding framework that preserves temporal information for improved image reconstruction.
It leverages subject-specific linear projections and per-timestep parameterization to convert continuous BOLD signals into effective conditioning for latent diffusion models.
Quantitative experiments demonstrate superior performance on metrics like SSIM, CLIP, and AlexNet, underscoring its potential for dynamic brain decoding and cross-subject generalization.

Dynadiff: Single-stage Decoding of Images from Continuously Evolving fMRI

Context and Motivation

Decoding images from fMRI has advanced rapidly due to increases in data scale and the adoption of diffusion models for image synthesis. However, prominent methods typically employ multi-stage pipelines with time-collapsed preprocessing, which discards dynamic information intrinsic to BOLD fMRI. Moreover, existing approaches often rely on extensive ad-hoc feature engineering, multi-stage fine-tuning, and require averaging across multiple repetitions of stimuli, restricting generalizability and temporal resolution.

Single-stage Diffusion-based Decoding Pipeline

The Dynadiff framework proposes an end-to-end, single-stage training pipeline that directly decodes images from continuous, time-resolved BOLD fMRI data. Unlike prevailing methods, Dynadiff handles the full time series of fMRI, thereby preserving temporal information. The pipeline consists of a brain module and a diffusion-based image generation module:

The brain module projects fMRI time series from a subject-specific voxel space to the conditional embedding space of a latent diffusion image generation model via a sequence of linear projections, normalization, and temporal aggregation. Key to this is (1) a subject-specific linear projection, (2) per-timestep parameterization (non-shared weights across time), and (3) late temporal aggregation.
The image generation module is adapted from pretrained latent diffusion models, interfaced via cross-attention layers using brain-derived embeddings as conditioning signals (with null text prompts).

This architecture enables joint training—diffusion model weights remain frozen except for LoRA adapters inserted into its cross-attention layers.

Figure 1: Schematic bird's-eye view of four foundational fMRI-to-image pipelines. Dynadiff enables single-stage end-to-end training and time-resolved conditioning, in contrast to prior multi-stage approaches.

Experimental Protocol

Experiments are conducted using the Natural Scenes Dataset (NSD): ultra-high-field (7T) fMRI from subjects exposed to 10,000 unique images, each shown three times. No repetition averaging is performed; models are evaluated on single-trial time series, leveraging BOLD signals restricted to a posterior cortex ROI. Each trial's fMRI sequence is preprocessed with slice-timing correction, motion and distortion correction, spatial resampling, detrending, and z-scoring—intentionally avoiding high-pass filtering and other potential information-distorting steps.

Quantitative and Qualitative Results

Dynadiff consistently outperforms state-of-the-art baselines—both classical (Brain-Diffuser) and modern multi-stage methods (MindEye1/2, WAVE), with marked improvement in both low-level (SSIM, AlexNet) and high-level (CLIP, DreamSim, mIoU) metrics. For example, Dynadiff achieves up to 98.20 on AlexNet(5), 93.53 on CLIP-12, and 8.50 on mIoU, all surpassing MindEye2 by nontrivial margins.

Qualitative reconstructions also show improved compositional alignment and object localization compared to alternatives, supporting metric superiority.

Temporal Generalization and Time-resolved Decoding

A critical finding is that Dynadiff enables true time-resolved decoding. The model generalizes to windows shifted in time relative to stimulus onset. Decoder performance peaks for windows beginning ~3s post-stimulus—a finding aligned with canonical HRF latency profiles. However, superior results are obtainable at each time point by training specialized decoders per window, revealing that image-representative neural codes evolve rapidly, with distinguishable patterns supporting accurate reconstructions at multiple temporally offset positions.

Figure 3: Image reconstruction metrics (e.g., SSIM, CLIP, AlexNet) as a function of fMRI time window, contrasting generalist and specialist decoders. Gray area denotes the image presentation interval.

Impact of Observation Window and Model Design

An ablation varying the decoding window's duration demonstrates that maximal decoding fidelity saturates between 3.9s and 7.8s, but does not improve with longer intervals, highlighting precision in the temporal mapping between fMRI and perceptual encoding.

Figure 5: Evolution of AlexNet, CLIP, and Inception metrics vs. window duration. Metrics plateau at moderate time windows, emphasizing importance of aligning window length with HRF dynamics.

Ablations further show that the presence of dedicated time-step layers and late temporal aggregation in the brain module are essential for optimal performance. The architecture's simplicity—eschewing separate alignment, candidate selection, or staged semantic refinement—has direct advantages for interpretability, reproducibility, and training efficiency.

Cross-subject Generalization

Dynadiff incorporates a parameter-efficient strategy for cross-subject training: only the initial subject- and time-specific projections are individualized, with the larger brain module shared among subjects. Pretraining on multiple subjects and fine-tuning on small amounts of target subject data provides measurable performance benefits, partially mitigating the per-subject data requirement and aligning with aspirations towards subject-agnostic brain decoding.

Figure 2: Cross-subject decoding accuracy (e.g., CLIP, SSIM) versus number of subject-specific data samples. Pretraining provides a clear sample efficiency improvement.

Theoretical and Practical Implications

The paper empirically demonstrates that the temporal dimension in fMRI signals carries non-redundant, dynamically evolving representations supporting perceptual decoding. The data-driven discovery that temporal generalization of decoders is limited and that specialized models are required to accurately capture evolving perceptual codes aligns with theories of dynamic coding in neural circuits, now extended to the fMRI regime.

Dynadiff's reduction of pipeline complexity stresses the feasibility of single-stage training for neural decoding and sets a strong baseline for temporally resolved image (and potentially video) reconstructions. This strongly suggests that subsequent work should embrace the continuous and high-dimensional BOLD time series, rather than rely on simplistic GLM-based time-collapsed representations.

Practically, the results refine the limits of decoding fidelity from non-averaged, single-trial fMRI and inform future designs of brain–computer interfaces, especially in the domain of perceptual prostheses and neuroethics.

Limitations and Future Developments

The approach is limited by its reliance on large, highly repeated, and potentially biased datasets such as NSD, as well as ongoing subject-specificity in training. The pipeline presumes access to standardized preprocessing, which, if streamlined via foundation models, could further relax these requirements. Importantly, generalization to unseen non-training subjects and to less stereotypical visual distributions remains an open problem.

Future work should target:

Generalization to unseen subjects via subject-invariant encoding,
Extension of the approach to temporally continuous visual input (video) and other task domains (e.g., speech, language),
Exploration of transfer learning across imaging modalities and tasks,
Further reduction of data and subject specificity requirements, possibly leveraging contrastive or self-supervised learning in the neuroimaging space.

Conclusion

Dynadiff advances the state of fMRI-to-image decoding by integrating a single-stage, end-to-end pipeline capable of decoding from dynamic, non-collapsed BOLD signals. It achieves superior empirical performance, facilitates fine-grained temporal characterization of perceptual neural codes, and reduces reliance on elaborate, multi-stage pipelines—thereby reshaping both methodological and theoretical approaches to neural decoding from fMRI data (2505.14556).