Papers
Topics
Authors
Recent
Search
2000 character limit reached

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

Published 2 Dec 2025 in cs.LG | (2512.03276v1)

Abstract: Training vision LLMs (VLMs) aims to align visual representations from a vision encoder with the textual representations of a pretrained LLM. However, many VLMs exhibit reduced factual recall performance compared to their LLM backbones, raising the question of how effective multimodal fine-tuning is at extending existing mechanisms within the LLM to visual inputs. We argue that factual recall based on visual inputs requires VLMs to solve a two-hop problem: (1) forming entity representations from visual inputs, and (2) recalling associated factual knowledge based on these entity representations. By benchmarking 14 VLMs with various architectures (LLaVA, Native, Cross-Attention), sizes (7B-124B parameters), and training setups on factual recall tasks against their original LLM backbone models, we find that 11 of 14 models exhibit factual recall degradation. We select three models with high and two models with low performance degradation, and use attribution patching, activation patching, and probing to show that degraded VLMs struggle to use the existing factual recall circuit of their LLM backbone, because they resolve the first hop too late in the computation. In contrast, high-performing VLMs resolve entity representations early enough to reuse the existing factual recall mechanism. Finally, we demonstrate two methods to recover performance: patching entity representations from the LLM backbone into the VLM, and prompting with chain-of-thought reasoning. Our results highlight that the speed of early entity resolution critically determines how effective VLMs are in using preexisting LLM mechanisms. More broadly, our work illustrates how mechanistic analysis can explain and unveil systematic failures in multimodal alignment.

Summary

  • The paper demonstrates that VLMs suffer from a two-hop process where late entity extraction bypasses early factual recall circuits inherent in LLM backbones.
  • It employs attribution and activation patching to reveal that adapter-based VLMs can experience up to 43.9% degradation, while native multimodal models largely preserve recall.
  • The study implies that enhanced multimodal pretraining or architectural redesign is crucial for aligning visual and textual representations to improve factual retrieval.

Mechanistic Analysis of Factual Recall Failures in Vision-LLMs

Introduction

"Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval" (2512.03276) provides a comprehensive mechanistic investigation into why state-of-the-art Vision-LLMs (VLMs) exhibit degraded factual recall compared to their LLM backbones. Through a combination of large-scale empirical benchmarking and interpretability experiments—including attribution patching, activation patching, and linear probing—the authors characterize VLM factual recall as a "two-hop" process, diagnostic of a critical architectural misalignment: VLMs must first recognize entities from visual data, often only achieving robust entity representation in deeper layers, which causes early factual recall circuits established by LLM pretraining to be bypassed. The study systematically deconstructs this failure mode and verifies hypotheses across multiple VLM architectures and scales.

The Two-Hop Problem: Early Entity Resolution as a Bottleneck

Entity-based factual recall in VLMs is fundamentally a two-hop computation: (1) the VLM must resolve which entity is present in the visual input, then (2) retrieve and deploy the LLM's factual knowledge about that entity. In VLM architectures with Adapter-based or Cross-Attention integration, the first hop—entity identification—often occurs too late in the computational graph. As a result, the early-layer multi-layer perceptrons (MLPs) in the LLM backbone, which are causally important for factual recall, are not engaged with a token representation of the entity, impairing subsequent retrieval of associated knowledge. In contrast, textual LLMs are provided subject tokens directly, allowing them to leverage early circuits for knowledge retrieval. Figure 1

Figure 1: Illustration of the two-hop problem comparing VLM and LLM factual recall pathways.

The figure exemplifies this contrast: a VLM must first infer "Perito Moreno Glacier" from an image, with the entity representation materializing in late layers, bypassing early MLPs; whereas the LLM receives the entity textually, activating early factual recall mechanisms and yielding the correct answer.

Empirical Benchmarking of Factual Recall

The authors curate a rigorous factual recall benchmark—15,000 entity-centric multimodal questions with images sourced from WIT, question templates crafted for a controlled comparison, and careful filtering to exclude NER errors. Fourteen VLMs, spanning Adapter, Native, and Cross-Attention architectures and sizes from 7B to 124B parameters, are evaluated directly against their LLM backbones.

Key results:

  • 11/14 VLMs show measurable factual recall degradation compared to their LLMs.
  • Adapter-based models (e.g., LLaVA-1.5-7B: -36.7%, LLaVA-MORE-8B: -43.9%) perform dramatically worse on factual recall.
  • Native VLMs—pretrained with multimodal objectives—preserve factual recall (Gemma-3-12B-it: 0.0% degradation, Gemini-2.0-Flash: -4.5%).
  • Exception: Qwen2.5-VL-72B-it, extensively multimodal-trained, outperforms its backbone (+8.2%), supporting the hypothesis that sufficient multimodal pretraining can enable early alignment.

These findings invalidate a simple parameter-scaling mitigation: even very large Adapter-based models degrade, while modality-native or massively multimodal-finetuned models can retain or improve factual recall.

Mechanistic Attribution of Factual Knowledge Circuits

Attribution Patching: Quantifying Causal Relevance

Attribution patching unveils the subnetworks in LLMs and VLMs critical for factual recall by systematically corrupting input embeddings and analyzing the KL divergence of the output distributions with and without patching. Gradient × Delta is computed per (layer, branch), establishing where and how information about the entity propagates. Figure 2

Figure 2: Causal attribution heatmap for MLP/Attention sublayers in LLM backbones; early-layer MLPs at entity-token positions dominate.

Figure 3

Figure 3: Attribution heatmap in VLMs; Adapter-based models shift causal reliance away from early MLPs, in contrast to LLMs.

Results demonstrate that in LLMs, early MLPs over entity-token positions are maximally causal for factual recall. By contrast, degraded Adapter-based VLMs lack attribution in these sublayers, instead activating only deeper sites, whereas Native VLMs (Gemma-3-12B, Qwen2.5-VL-7B) maintain the dual-site pattern, matching their LLM backbones.

Activation Patching: Restoring Degraded Factual Recall

The authors further execute activation patching, transplanting entity-token MLP outputs from the LLM into the VLM at early backbone layers. This directly tests whether bypassed early-layer circuits are sufficient for factual retrieval if correctly stimulated. Figure 4

Figure 4: Performance recovery in LLaVA-derived VLMs via MLP activation patching; up to 35% of the factual gap closed.

Heuristic patching closes ~35% of the factual recall gap—substantially more than random or back-patching baselines—functionally validating the causal bottleneck at early MLP sites.

Representation Probing: Layerwise Emergence of Entity Knowledge

Layerwise linear probes are trained on residual streams to decode the entity class from representations at each transformer layer for five models. This identifies where in the backbone robust visual entity representations arise. Figure 5

Figure 5: Linear probe accuracy for entity prediction as a function of backbone depth; delayed emergence in LLaVA-style models, immediate in Native and highly multimodal-finetuned models.

  • LLaVA-style models only achieve high probe accuracy mid-to-late in the backbone, confirming late entity representation.
  • Native (Gemma-3-12B) and Qwen2.5-VL-7B show high probe accuracy even in early layers, consistent with their preserved recall.

These results directly map empirical degradation to representational emergence, mechanistically substantiating the two-hop bottleneck.

Dataset Design and Validation

The benchmark's construction pipeline ensures that factual recall assessment is not confounded by recognition failures. Images are paired with entity names and question templates, and models must demonstrate successful entity recognition prior to factual evaluation. Figure 6

Figure 6: Dataset construction pipeline ensuring reliable, entity-anchored factual recall evaluation in a multimodal setting.

A case-study of filtered samples shows random error distribution, with no systematic bias.

Quantifying Projector Misalignment

The authors confirm prior findings that MLP-based visual projectors in Adapter-style VLMs produce output embeddings nearly orthogonal to token embeddings in the backbone LLM. Figure 7

Figure 7: Distribution of cosine similarity between visual projector outputs and LLM token embeddings; peaked at near-zero, indicating strong misalignment.

This represents a fundamental architectural disconnect, explaining why Adapter-projected visual information cannot stimulate early factual recall circuits.

Prompt-Based Mitigation: Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting is evaluated as an inference-time mitigation. By instructing models to reason via an intermediate textual step describing the image, VLMs are induced to produce entity-token representations, thereby enabling factual knowledge retrieval circuits in the LLM backbone.

  • CoT prompting yields model-dependent gains, closing up to the full factual recall gap in Pixtral-12B, Pixtral-Large-124B (Adapter), and Gemini-2.0-Flash (Native).
  • In some VLMs, notably robust ones, CoT eliminates residual factual recall deficits entirely.

Implications and Future Directions

These results underscore that successful multimodal alignment is not simply about representational compatibility or brute-force scaling, but about integrating visual representations into the functional circuits established by LLM pretraining. The observed failure mode is not remediable by generic fine-tuning; it requires structural solutions that deliver early, entity-level alignment between modalities.

Practical Implications

  • Adapter-based plug-and-play VLMs without sufficient multimodal pretraining are fundamentally limited in factual grounding; this impacts their trustworthiness in sensitive real-world settings.
  • Chain-of-thought prompting is a partial, inference-time fix but is inefficient and unreliable across the board.

Theoretical Perspective

  • The study situates VLM mechanistic analysis within the broader literature on multi-hop reasoning and factual recall in LLMs, extending insights from unimodal circuits to the multimodal case.
  • Future research should focus on data-efficient adapter designs and multimodal pretraining strategies that induce early token-space alignment, or fundamentally new architectures.

Conclusion

This work provides a mechanistic diagnosis and causal evidence for factual recall degradation in most vision-LLMs, tracing the issue to late emergence of robust entity representations and bypassed factual circuits in the backbone. Only VLMs with native multimodal training or extremely extensive multimodal fine-tuning escape this bottleneck. Systematic interpretability, intervention experiments, and layerwise probing together delineate a clear research agenda for improving reliable multimodal knowledge retrieval, with implications for the design of the next generation of VLM architectures and benchmarks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 54 likes about this paper.