HippoRAG 2: Enhanced Memory for LLMs

Updated 14 December 2025

The paper introduces HippoRAG 2, a framework that unifies dense and sparse retrieval to boost factual recall, associative reasoning, and sense-making in LLMs.
It employs a dual-node knowledge graph with passage and phrase nodes, enhanced by Personalized PageRank and LLM-based triple filtering for deep contextual retrieval.
Benchmark experiments demonstrate a 7-point F1 gain over embedding retrievers on associative tasks while significantly reducing LLM token usage.

HippoRAG 2 is a non-parametric continual learning framework for LLMs that augments retrieval-augmented generation (RAG) with explicit, context-rich memory mechanisms, specifically constructed knowledge graphs (KGs) and Personalized PageRank (PPR), achieving superior performance in factual recall, sense-making, and associative retrieval. Designed to mimic the dynamic, interconnected character of human long-term memory, HippoRAG 2 builds upon its predecessor HippoRAG by incorporating deeper passage integration, more contextualized query–triple linking, and an online LLM loop both for knowledge extraction/filtering and answer generation. The system addresses deficiencies in previous graph-augmented RAG architectures, which often compromised factual recall or sense-making for associativity, by unifying dense and sparse representations, adding passage nodes to the KG, and leveraging LLM-powered recognition filtering. In benchmark experiments, HippoRAG 2 lifts associative QA F1 by 7 points over state-of-the-art embedding retrievers, while also excelling in factual and discourse-oriented tasks (Gutiérrez et al., 20 Feb 2025).

1. Motivation and Design Principles

The core motivation for HippoRAG 2 is to endow LLMs with a non-parametric continual memory system capable of context-sensitive knowledge acquisition, recall, and integration—key features of human memory. Standard RAG workflows, reliant on nearest-neighbor vector retrieval, struggle with catastrophic forgetting in fine-tuning and lack the capacity for multi-hop associations (“associativity”) and deep context interpretation (“sense-making”). Structure-augmented approaches with knowledge graphs partly address associativity but have yielded trade-offs, usually reducing performance on basic factual QA. HippoRAG 2 aims to unify memory recall mechanisms to optimize all three memory modalities—factual, associative, and sense-making—through innovations including:

Dense–sparse integration: Passage nodes and phrase nodes co-exist in the KG.
Deep contextualization: Queries directly link to full triples.
Recognition memory: LLM-based filtering of relevant triples.
Online LLM loop: LLM handles both knowledge graph maintenance and final-answer reading.

2. Memory Graph Construction

The HippoRAG 2 knowledge graph includes both phrase nodes $p \in P$ (text spans from OpenIE triple extraction) and passage nodes $d \in D$ (full passages or documents from the corpus). This dual-node structure enables dense–sparse integration. Edges fall into three categories:

Relation edges: Connect phrase nodes for each KG triple $(s, r, o)$ , with undirected weight $w_{so} = 1$ .
Synonym edges: Between phrase nodes if their embeddings $e_i, e_j$ satisfy similarity $\textrm{sim}(e_i, e_j) \geq \tau$ , $\tau=0.8$ ; $w_{ij} = \textrm{sim}(e_i, e_j)$ .
Context edges: Link every phrase node extracted from a passage to that passage node; $w_{dp} = 1$ .

The graph is represented as an adjacency matrix $A$ over the full node set $d \in D$ 0, normalized row-wise:

$d \in D$ 1

where $d \in D$ 2. The KG is static offline; online, only the personalization vector for PPR changes following LLM-driven triple filtering.

3. Personalized PageRank and Retrieval

HippoRAG 2 applies PPR over the normalized adjacency matrix $d \in D$ 3, producing a contextually ranked retrieval:

$d \in D$ 4

where $d \in D$ 5 is fixed and $d \in D$ 6 is a personalization vector defined via relevant phrase and passage seed nodes informed by the query and triple scores. The vector $d \in D$ 7 is:

$d \in D$ 8

with $d \in D$ 9 being the average retrieval score for triple-generating phrase nodes or weighted embedding similarity for passage nodes. PPR is solved by power iteration. The top-ranked passage nodes select the contextual passages for the LLM reader downstream.

4. Deep Passage Integration and Prompt Construction

“Deep” passage integration involves encoding both passage and query into dense vectors $(s, r, o)$ 0, concatenated to form the final context-aware prompt embedding:

$(s, r, o)$ 1

In transformer-based LLMs, this concatenation is consumed in encoder layers or injected into cross-attention for the memory bank at each generation step:

$(s, r, o)$ 2

Practically, passages are prepended (delimited) in natural language before the query for answering.

5. Online Retrieval and Generation Workflow

The online operational logic comprises four sequential steps:

a. Query–Triple Linking and Passage Ranking: Compute embeddings for the query and match against KG triple texts and passage embeddings; retrieve top- $(s, r, o)$ 3 triples $(s, r, o)$ 4 and passages $(s, r, o)$ 5.

b. Recognition Memory (Triple Filtering): Feed the query plus candidate triples $(s, r, o)$ 6 into a secondary LLM (e.g., Llama-3.3-70B-Instruct) with a prompt designed for filtering out triples irrelevant to the query (see paper Appendix A). Resulting set is $(s, r, o)$ 7.

c. Seed Node Selection and PPR: Extract up to five phrase nodes from $(s, r, o)$ 8, score based on triple score, plus all passage nodes with scaled embedding similarity; construct $(s, r, o)$ 9 for PPR, returning top- $w_{so} = 1$ 0 passages $w_{so} = 1$ 1.

d. Final Generation: Concatenate $w_{so} = 1$ 2 as context and prompt the LLM for the answer output.

Optional post-processing allows addition of new high-confidence facts back into the KG via OpenIE and synonym detection, supporting continual learning.

6. Experimental Protocol and Evaluation

HippoRAG 2 was empirically validated on three major task types:

Task Type	Benchmarks	Metrics
Factual Recall	NaturalQuestions (NQ), PopQA	Recall@5, EM, F1
Associativity	MuSiQue, 2Wiki, HotpotQA, LV-Eval	Recall@5, EM, F1
Sense-making	NarrativeQA	EM, F1

Passage Recall@5 measures percentage of queries with supporting passage retrieved in top 5. Exact Match (EM) and F1 reflect generation accuracy. Key result: on associative benchmarks, HippoRAG 2 achieves a mean +7 F1 gain over NV-Embed-v2, the embedding retriever baseline. Factual and sense-making tasks also show slight improvements.

7. Comparative Evaluation and Limitations

Compared to state-of-the-art pure-embedding RAG (NV-Embed-v2 + LLM), HippoRAG 2 lifts multi-hop QA F1 (e.g., MuSiQue: 44.8→51.9) and Recall@5 (e.g., MuSiQue: 69.7%→74.7%, 2Wiki: 76.5%→90.4%). Structure-based approaches (RAPTOR, GraphRAG, LightRAG, HippoRAG) may improve associativity or sense-making, but generally reduce performance on simple QA by 5–10 F1, which HippoRAG 2 avoids. It also requires significantly fewer LLM tokens for indexing (e.g., 9M versus 115M for MuSiQue). Nevertheless, the LLM triple filter exhibits $w_{so} = 1$ 37% miss rate, and sparse seeds can limit PPR effectiveness.

8. Future Directions

Future work concentrates on:

Integrating episodic memory for extended dialogue contexts.
Automatic consolidation/pruning of memory over large document collections.
Dynamic graph adaptation reflecting ongoing conversation context.

These directions aim to further approximate human-like conversational memory and scalability in continual learning (Gutiérrez et al., 20 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HippoRAG 2.