HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Published 3 Oct 2025 in cs.CV and cs.AI | (2510.03122v1)

Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the HAVIR model, a hierarchical framework that decouples structural and semantic processing to achieve superior fMRI-to-image reconstruction.
It employs a Structural Generator and a Semantic Extractor whose outputs are fused via a pre-trained Versatile Diffusion model guided by CLIP, achieving top metrics like PixCorr and Inception Score.
Ablation studies and cross-subject adaptability demonstrate HAVIR’s capability to maintain detailed spatial layouts and semantic accuracy, setting a new standard in brain decoding.

Hierarchical fMRI-to-Image Reconstruction via CLIP-Guided Versatile Diffusion: The HAVIR Model

Introduction

The HAVIR model introduces a hierarchical approach to reconstructing visual stimuli from fMRI data, leveraging the biological organization of the human visual cortex. By explicitly separating structural and semantic processing pathways, HAVIR addresses the limitations of prior methods in decoding complex scenes, where low-level features are heterogeneous and high-level features are semantically entangled. The framework integrates a Structural Generator and a Semantic Extractor, whose outputs are fused via a pre-trained Versatile Diffusion (VD) model guided by CLIP embeddings. This design enables HAVIR to achieve superior reconstruction fidelity, both in terms of spatial structure and semantic content, as demonstrated on the NSD dataset.

Figure 1: Overall framework of HAVIR. Different data are used for training and testing.

Model Architecture

Structural Generator

The Structural Generator maps spatial-processing voxels from early visual cortex regions to latent diffusion priors. The architecture consists of linear projections, LayerNorm, SiLU activation, Dropout, and residual MLP blocks, followed by upsampling and GroupNorm. The output is a latent vector $\mathrm{z}'_{\text{img}}$ aligned with the AutoKL-encoded latent representation of the ground truth image. Training objectives combine MSE loss for global alignment and Sobel loss for edge-aware consistency, ensuring preservation of spatial topology and high-frequency details.

Semantic Extractor

The Semantic Extractor processes semantic-processing voxels from anterior visual cortex regions, projecting them into CLIP image and text embedding spaces. The network employs linear mapping, normalization, activation, dropout, and residual MLP blocks, with separate heads for image and text modalities. Training utilizes a combination of SoftCLIP contrastive loss and MSE loss to align predicted embeddings with those from pretrained CLIP encoders, mitigating artifacts from batch-wise soft labels and ensuring robust semantic mapping.

Versatile Diffusion Integration

The VD model synthesizes the final image by integrating the structural prior and semantic constraints. Initialization starts from $\mathrm{z}'_{\text{img}}$ with controlled noise addition, parameterized by a structural strength coefficient. Semantic conditioning is imposed via cross-attention mechanisms, guiding the UNet at each denoising step with both image and text CLIP embeddings. The iterative denoising process progressively refines the latent representation, balancing structural fidelity and semantic accuracy.

Experimental Results

Reconstruction Quality

HAVIR demonstrates robust reconstruction of complex scenes, accurately capturing both spatial layouts and semantic attributes across subjects. Qualitative results show preservation of ambient lighting, object positions, and key semantic features such as color and context.

Figure 2: Examples of the reconstruction results from HAVIR.

Comparison with State-of-the-Art

HAVIR outperforms five recent SOTA methods (BrainGuard, NeuroPictor, Psychometry, MindBridge, MindEye) in both qualitative and quantitative evaluations. In challenging cases, HAVIR uniquely recovers fine-grained semantic details and spatial arrangements that other models fail to reconstruct.

Figure 3: Qualitative comparisons on the NSD test dataset. The results of HAVIR demonstrate superior reconstruction accuracy compared to the five recent SOTA methods.

Quantitative metrics show HAVIR achieves the highest PixCorr (0.327) and Inception Score (96.4%), and near-optimal AlexNet(5) feature similarity (98.1%). The model ranks third in SSIM (0.332), indicating strong structural preservation. High-level semantic metrics (CLIP, EffNet-B, SwAV) further confirm HAVIR's superior semantic fidelity.

Ablation Studies

Systematic ablation reveals the necessity of multimodal joint guidance. Removing CLIP guidance collapses semantic accuracy, while omitting structural priors degrades spatial fidelity. The full HAVIR model achieves balanced optimization across all metrics, validating the complementary roles of structural and semantic pathways.

Figure 4: Qualitative results of the full model and its ablated configurations

Interpretability and Cross-Subject Adaptability

Voxel-wise contribution mapping demonstrates that HAVIR dynamically adapts to individual neuroanatomical and functional variability. The model assigns distinct weights to different brain regions for each subject, enabling personalized decoding pathways and stable reconstruction performance.

Figure 5: Spatial mapping of brain region contributions to Structural Generator (A) and Semantic Extractor (B) on Subj01.

Figure 6: Voxel-wise contribution intensity maps for Subj02, Subj05 and Subj07, demonstrating cross-subject adaptability.

Implementation Details

HAVIR is trained from scratch for 200 epochs on Subj01 (40 hours fMRI, NVIDIA A800 80GB, batch size 30), then fine-tuned via transfer learning on Subj02, 05, and 07 (1 hour each, 3×RTX 4090 24GB, batch size 50). Individualized ROI masks are used for voxel selection, enhancing anatomical specificity and functional mapping precision.

Implications and Future Directions

HAVIR's hierarchical design, multimodal fusion, and individualized brain mapping set a new standard for fMRI-based visual reconstruction. The model's ability to disentangle and jointly optimize structural and semantic features opens avenues for more interpretable and generalizable brain decoding systems. Future work may extend HAVIR to other sensory modalities, integrate temporal dynamics, or explore zero-shot generalization across unseen subjects and stimuli. The approach also has potential applications in brain-computer interfaces, cognitive neuroscience, and clinical diagnostics.

Conclusion

HAVIR introduces a biologically inspired, hierarchical framework for fMRI-to-image reconstruction, achieving state-of-the-art performance in both structural and semantic fidelity. By decomposing brain signals into distinct processing streams and leveraging multimodal diffusion guidance, HAVIR overcomes the limitations of prior models in complex scene reconstruction. The model's adaptability to individual brain architectures and its robust quantitative and qualitative results underscore its significance for computational neuroscience and AI-driven brain decoding.