Disentangling the Factors of Convergence between Brains and Computer Vision Models

Published 25 Aug 2025 in cs.AI and q-bio.NC | (2508.18226v1)

Abstract: Many AI models trained on natural images develop representations that resemble those of the human brain. However, the factors that drive this brain-model similarity remain poorly understood. To disentangle how the model, training and data independently lead a neural network to develop brain-like representations, we trained a family of self-supervised vision transformers (DINOv3) that systematically varied these different factors. We compare their representations of images to those of the human brain recorded with both fMRI and MEG, providing high resolution in spatial and temporal analyses. We assess the brain-model similarity with three complementary metrics focusing on overall representational similarity, topographical organization, and temporal dynamics. We show that all three factors - model size, training amount, and image type - independently and interactively impact each of these brain similarity metrics. In particular, the largest DINOv3 models trained with the most human-centric images reach the highest brain-similarity. This emergence of brain-like representations in AI models follows a specific chronology during training: models first align with the early representations of the sensory cortices, and only align with the late and prefrontal representations of the brain with considerably more training. Finally, this developmental trajectory is indexed by both structural and functional properties of the human cortex: the representations that are acquired last by the models specifically align with the cortical areas with the largest developmental expansion, thickness, least myelination, and slowest timescales. Overall, these findings disentangle the interplay between architecture and experience in shaping how artificial neural networks come to see the world as humans do, thus offering a promising framework to understand how the human brain comes to represent its visual world.

Abstract PDF Upgrade to Chat

Summary

The paper shows that model size, training duration, and image domain individually and jointly drive representational convergence between DINOv3 and human brain responses, using high-resolution fMRI and MEG data.
It employs linear encoding analysis with ridge regression to quantify brain-model similarity through encoding, spatial, and temporal metrics, analyzing various DINOv3 variants.
The findings reveal a staged emergence of brain-like representations that mirror cortical developmental trajectories, emphasizing the importance of ecologically valid training data.

Disentangling the Factors of Convergence between Brains and Computer Vision Models

Introduction

This paper provides a systematic investigation into the factors that drive representational convergence between artificial neural networks and the human brain in visual processing. By leveraging a family of self-supervised vision transformers (DINOv3) and manipulating model size, training duration, and image domain, the authors dissect the independent and interactive contributions of architecture, experience, and data type to brain-model similarity. The study employs high-resolution fMRI and MEG datasets to quantify representational alignment using encoding, spatial, and temporal metrics, offering a multi-faceted view of how and when artificial models come to resemble biological vision.

Methodological Framework

The core methodology centers on linear encoding analysis, quantifying the correspondence between DINOv3 activations and brain responses to identical images. Ridge regression is used to map model activations to brain activity, with cross-validation ensuring robust estimation. Three metrics are defined:

Encoding Score: Pearson correlation between predicted and actual brain responses, summarizing overall representational similarity.
Spatial Score: Correlation between model layer hierarchy and anatomical hierarchy (Euclidean distance from V1), probing topographical alignment.
Temporal Score: Correlation between model layer depth and the timing of peak predictability in MEG signals, assessing dynamic correspondence.

Multiple DINOv3 variants are trained from scratch, varying in parameter count (Small, Base, Large, Giant, 7B), training steps, and image domain (human-centric, satellite, cellular). The use of both fMRI and MEG enables spatial and temporal resolution in the analysis.

Brain-Model Representational Similarity

The results confirm robust representational similarity between DINOv3 and the human brain, with encoding scores peaking in the visual pathway (R= $.45\pm.039$ in fMRI, R= $.09\pm.017$ in MEG). Notably, predictability extends beyond classical visual regions into prefrontal cortices, challenging the notion that model-brain alignment is confined to sensory areas.

Figure 1: DINOv3 embeddings exhibit significant similarity to fMRI and MEG responses, with peak encoding scores in visual cortices and sustained predictability in prefrontal regions.

Spatial analysis reveals a hierarchical correspondence: early model layers best predict low-level sensory regions (V1), while deeper layers align with higher-order cortices (e.g., BA44, IFSp). The spatial score (R=0.38, p< $1e^{-6}$ ) quantifies this alignment.

Temporal analysis demonstrates that model layer depth tracks the timing of brain responses, with the first and last layers aligning with the earliest and latest MEG signals, respectively (temporal score R=0.96, p< $1e^{-12}$ ).

Figure 2: The representational hierarchy of DINOv3 mirrors the anatomical and temporal hierarchy of the cortex, with significant correlations in both spatial and temporal scores.

Developmental Trajectories and Training Dynamics

A key contribution is the characterization of the developmental trajectory of brain-like representations in DINOv3. The emergence of encoding, spatial, and temporal scores is non-simultaneous:

Encoding scores reach half-maximal values early in training (~2% of total steps).
Temporal scores emerge even faster (~0.7%).
Spatial scores require more extensive training (~4%).

Low-level visual regions and early MEG windows converge rapidly, while high-level cortices and late MEG windows require substantially more data. The correlation between half time and anatomical location (R=0.91, p< $1e^{-5}$ ) and between half time and temporal peak (R=0.84, p< $1e^{-5}$ ) underscores the staged acquisition of representations.

Figure 3: Training progression reveals rapid emergence of low-level alignment and delayed convergence for high-level regions and late temporal windows.

Figure 4: The half time of representational alignment varies systematically across brain regions and temporal windows, reflecting staged developmental trajectories.

Impact of Model Size and Image Domain

Model size exerts a pronounced effect on convergence. Larger models achieve higher encoding scores (Giant: R=0.107 > Large: R=0.105 > Base: R=0.101 > Small: R=0.096, p< $1e^{-3}$ ) and better encode higher-order cortices. The effect is less pronounced in early visual areas, indicating that architectural capacity is most critical for complex, high-level representations.

Figure 5: Larger DINOv3 models yield higher encoding, spatial, and temporal scores, especially in higher-order cortical regions.

Image domain also modulates convergence. Models trained on human-centric images outperform those trained on satellite or cellular images across all metrics (p< $1e^{-3}$ ), with the effect observed in both low-level and high-level regions. This suggests that ecological validity of training data is essential for full representational alignment.

Figure 6: Human-centric image training leads to superior brain-model similarity compared to satellite or cellular domains.

Relationship to Cortical Properties

The developmental trajectory of representational alignment in DINOv3 is strongly indexed by structural and functional cortical properties:

Cortical Expansion: Regions with greater developmental growth exhibit longer half times (R=0.88, p< $1e^{-3}$ ).
Cortical Thickness: Thicker regions align later (R=0.77, p< $1e^{-2}$ ).
Intrinsic Timescales: Regions with slower dynamics require more training (R=0.71, p=0.022).
Myelin Concentration: Higher myelination correlates with shorter half times (R=-0.85, p= $1e^{-3}$ ).
Figure 7: The speed of representational convergence in DINOv3 is predicted by cortical expansion, thickness, timescale, and myelination.

Theoretical and Practical Implications

The findings provide empirical support for the interaction between architectural potential and experiential data in shaping representational convergence. The staged acquisition of brain-like representations in DINOv3 parallels the ontogeny of the human cortex, with early sensory areas maturing rapidly and associative cortices developing slowly. This suggests that artificial models can serve as computational analogs for studying cortical development and the principles underlying hierarchical information processing.

The extension of model-brain alignment into prefrontal regions and the dependence on ecologically valid data challenge simplistic views of convergence and highlight the necessity of both scale and domain relevance in model training. The results also inform debates on nativism versus empiricism, demonstrating that both innate architectural capacity and experiential input are necessary for full alignment.

Limitations and Future Directions

The study is limited to a single family of hierarchical, self-supervised vision transformers. Generalization to other architectures and modalities remains an open question. The reliance on adult brain data precludes developmental analyses, and the coarse resolution of fMRI and MEG may obscure fine-grained mechanisms. Future work should extend these analyses to other model families, developmental cohorts, and higher-resolution neural recordings.

Conclusion

This paper provides a rigorous, multi-factorial analysis of the convergence between artificial vision models and the human brain, demonstrating that model size, training duration, and image domain independently and interactively shape representational alignment. The staged emergence of brain-like representations in DINOv3 mirrors cortical development and is indexed by structural and functional properties of the cortex. These findings establish a framework for using artificial models to probe the organizing principles of biological vision and inform the design of future AI systems that more closely emulate human cognition.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, actionable list of what remains missing, uncertain, or unexplored in the paper.

Exclusivity of DINOv3: The study only investigates a single self-supervised ViT family (DINOv3). It remains unknown whether the observed spatial, temporal, and encoding alignments generalize to other architectures (e.g., CNNs, CoAtNet, MLP-Mixers), training objectives (supervised, contrastive, masked modeling, generative), and multimodal models (e.g., CLIP, SigLIP, MAE).
Confounds in model scaling: “Size” varies both depth and width across variants; the specific contribution of depth vs width vs token capacity vs patch size to brain alignment is not disentangled. Controlled ablations are needed.
Training recipe confounds: Training schedules differ (e.g., batch size 4096 vs 2048; total steps and data scale vary by model/data type). The causal attribution to “data type” vs “training recipe” remains ambiguous without fully matched hyperparameters and compute budgets.
Data type attribution: Human-centric vs satellite vs cellular datasets are matched in quantity (10M), but not in low-level statistics (spectral content, color channels), resolution, camera geometry, or augmentations. Which visual statistics drive higher brain alignment is not isolated.
Augmentation effects: DINO-style strong augmentations (crops, color jitter, blur) may shape invariances relevant for cortical alignment, but their role is not explicitly varied or measured.
Layer selection and feature choice: The paper alternately references “22 layers” while models have 12–40 layers, and it is unclear whether features come from CLS token, patch tokens, or pooled embeddings. This ambiguity can bias layer–ROI mapping and temporal analyses.
Argmax-layer mapping biases: Using the “best” (argmax) layer per ROI/timepoint may obscure multi-layer contributions. Comparing single-layer vs learned linear combinations (layer mixing) and testing whether hierarchy effects persist under mixing would clarify the robustness of the hierarchy claims.
Linear-only encoding: Similarity is assessed with linear ridge models. Potential nonlinear relationships, recurrent dynamics, and cross-layer interactions remain unexplored; RSA/CKA or nonlinear encoding could reveal additional structure.
Noise ceilings and reliability: No explicit noise ceiling or split-half reliability is used for fMRI/MEG, making it hard to judge how close model performance is to the measurable upper bound and to compare ROIs fairly.
Spatial autocorrelation controls: Correlations between half-time maps and cortical properties (thickness, myelin, expansion, timescales) are not corrected for spatial autocorrelation (e.g., spin tests). Results may be inflated by spatial smoothness.
Coarse spatial hierarchy proxy: The “distance from V1” is a very coarse proxy for cortical hierarchy. Testing against established anatomical/functional hierarchies (ventral vs dorsal stream parcellations, connectivity-based gradients, cytoarchitectonic maps) is needed.
Limited cortical coverage: Analyses focus on 15 ROIs and largely on visual and prefrontal areas; dorsal stream, parietal, and multimodal association regions are undercharacterized. Whole-cortex, fine-grained parcellations could refine conclusions.
MEG preprocessing ambiguities: The MEG description includes a likely typo (“time-lock to words”) and an unusual temporal-score definition (averaging windows where normalized brain-score ≥95%). A more standard treatment (e.g., latency of peak, onset, sustained responses; broader frequency content) is warranted.
Temporal dynamics vs recurrence: ViTs are feedforward; yet late MEG responses and PFC alignment may depend on recurrent/feedback processes, attention, or task engagement in the brain. Testing recurrent or attention-modulated models could probe these mechanisms.
Initial negative spatial/temporal scores: Deep layers of random networks initially best predict early/low-level responses. The mechanistic cause of this inversion remains unexplained—does it arise from positional embeddings, patch token statistics, or initialization?
Training chronology attribution: “Half-times” are reported as fractions of steps, but not normalized for the number of unique images, optimizer state, or augmentation intensity across models. The extent to which chronology reflects data exposure vs optimization dynamics remains unclear.
Static image limitation: Stimuli are static. Alignment to motion, temporal integration, and dorsal-stream functions cannot be assessed without videos or dynamic stimuli.
Adult-only brain data: The developmental interpretation is speculative without infant/child or longitudinal data to test whether model training trajectories mirror cortical maturation.
Task and cognitive state mismatch: NSD involves a recognition task and THINGS-MEG involves fixation, while models are trained with self-supervised objectives. How attention, task demands, and behavioral goals modulate alignment is not tested.
Generalization and OOD robustness: Alignment is measured on natural images similar to model training. Whether alignment persists for out-of-distribution stimuli (e.g., textures, adversarial patterns, illusions, synthetic shapes) is unknown.
Semantic content drivers: The gain from human-centric data could be driven by faces, bodies, text, scene semantics, or social content. Systematic content-controlled experiments are needed to isolate which semantic factors boost alignment, especially in PFC.
Cross-model reproducibility: Results rely on single training runs per variant. Variance across seeds, checkpoints, and random initializations is not reported, limiting claims about robustness of trends.
Alignment vs task performance: No analysis links computer vision performance (e.g., segmentation, detection) to brain alignment. Are improvements in certain tasks predictive of alignment in specific ROIs?
Multimodal augmentation: Prior work suggests high-level alignment with LLMs; testing vision–language pretraining (e.g., CLIP/SigLIP) could clarify whether multimodal objectives selectively improve alignment in associative cortex.
Token-space topology and retinotopy: How patch-level attention and positional embeddings align to retinotopic maps (e.g., eccentricity, polar angle) is not examined.
Hemodynamic modeling: fMRI analyses use a fixed 5.5s post-onset sample. Region-specific HRF variability and deconvolution are not modeled, which may bias spatial conclusions.
Parcellation and atlas dependence: ROI choices, surface mapping, and thresholding (FDR p<0.01) can affect which regions appear aligned. Testing across multiple parcellations and thresholds would assess robustness.
Cellular/satellite preprocessing: Differences in channels (e.g., fluorescence) and preprocessing may degrade alignment in non-human-centric datasets. Matching photometric statistics and camera geometry could test whether low-level harmonization closes the gap.
Depth-dependent timescales: The strong temporal-score correlation (R≈0.96) could partly reflect monotonic layer ordering rather than genuine neurophysiological mapping. Control analyses (e.g., shuffled layer indices, equalized feature dimensionality) are needed to rule out trivial order effects.
Causal interventions: No causal tests (e.g., feature ablations, synthetic stimuli controlling shape/texture, phase scrambling) to identify which representational dimensions drive alignment across ROIs and time.
Feedback/attention modeling: The paper does not test models with explicit top-down mechanisms (task signals, attention modules) that may be critical for aligning higher-order cortical regions (e.g., IFG/IFS).
Species and modality generality: Alignment is only evaluated in human vision. Whether the identified principles extend to nonhuman primates or to other sensory modalities (audition, somatosensation) remains open.

View Paper Prompt View All Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Tweets

Disentangling the Factors of Convergence between Brains and Computer Vision Models

Summary

Disentangling the Factors of Convergence between Brains and Computer Vision Models

Introduction

Methodological Framework

Brain-Model Representational Similarity

Developmental Trajectories and Training Dynamics

Impact of Model Size and Image Domain

Relationship to Cortical Properties

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Tweets

YouTube

HackerNews

alphaXiv